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A  COMPARATIVE  STUDY  OF  SOME  VISUAL 
SPEECH  DISPLAYS 

Bernard  Joseph  Nordmann  Jr.  ,  Ph>.D. 
Department  of  Computer  Science 
University  of  Illinois  at  Urbana-Champaign,  1971 

The  purpose  of  the  present  project  was  to  develop  a  computer 
speech  display  simulation  system  capable  of  generating  a  wide  variety  of 
speech  displays  from  a  recorded  speech  input.   Eventually  it  is  hoped  that 
this  will  lead  to  a  system  whereby  a  person  can  obtain  visual  feedback  as 
a  corrective  measure  for  word  pronunciation.   The  basic  system  would 
involve  two  displays,  one  representing  the  subject's  pronunciation  of  a 
particular  word  and  the  other  representing  a  correct  pronunciation  of  the 
word.   A  computer  would  be  used  to  process  the  incoming  speech  and  produce 
a  display  containing  features  highly  relevant  to  correct  pronunciation. 
The  subject's  task  would  be  to  detect  differences  in  the  two  displays  and 
to  change  his  pronunciation  so  as  to  make  them  more  similar. 

After  conducting  an  extensive  literature  search  to  determine  the 
types  of  schemes  which  had  previously  been  used  to  display  speech  sounds, 
a  basic  interactive  display  system  was  programmed  using  the  CSL's  CDC  l6oU 
computer-graphics  facility.   The  system  has  been  designed  to  be  open-ended 
and  currently  can  produce  photographs  of  a  variety  of  display  types.   Unfor- 
tunately, the  system  as  it  stands  now  cannot  operate  in  real  time  due  to  the 
slowness  of  the  CDC  160U. 

The  simulation  system  was  used  to  produce  examples  of  several 
different  types  of  displays.   These  displays  were  used  in  a  series  of  pre- 
liminary tests  designed  to  develop  techniques  for  comparing  the  effective- 
ness of  various  types  of  displays.   Several  corrections  and  refinements  to 
the  testing  methods  are  discussed. 
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Chapter  1 
INTRODUCTION 

The  purpose  of  this  study  is  to  investigate  several  methods 
for  producing  visual  displays  of  speech  signals.   Visual  speech  displays 
are  generally  used  either  as  speech  analyzers  or  as  speech  recognizers. 
In  the  first  case  they  can  he  used  to  extract  a  greater  or  lesser  amount 
of  information  from  a  speech  utterance  and  this  information  can  then  be 
recorded  and  compared  with  displays  of  other  utterances  to  determine  the 
types  of  information  "which  characterize  speech.   Traditionally,  there 
have  been  two  separate  approaches  to  speech  display  analysis:   one  which 
attempts  to  determine  a  display  transform  which  will  present  all  the 
information  necessary  to  determine  the  various  phonemes  and  the  other 
which  takes  a  display  of  a  single  type  of  speech  parameter  and  tries  to 
see  how  much  discrimination  can  be  obtained  from  it.   The  former  approach 
has  traditionally  been  followed  by  experimenters  whose  eventual  aim 
was  to  build  a  workable  speech  recognizer  while  the  latter  approach  has 
been  used  by  people  involved  in  speech  therapy  to  help  correct  specific 
speech  problems.   An  additional  distinction  between  the  approaches  is 
that  the  former  have  tended  to  be  much  more  expensive. 

In  the  speech  recognition  type  of  display  utilization,  the 
display  produces  a  visual  image  from  a  sound  input  and  the  viewer  has  to 
decide  what  utterance,  out  of  all  possible  utterances,  is  being  displayed. 
In  the  most  powerful  form  of  this  display,  the  speech  typewriter,  the 
output  would  consist  of  the  typed  version  of  the  word  or  words  spoken. 
It  can  be  argued  that  this  is  not  a  display  but  rather  a  full-fledged 
speech  recognizer.   In  any  case,  we  will  ignore  it  for  the  present. 
In  the  less  powerful  forms,  this  type  of  display  produces  an  output 


image  which  represents  some  transformation  of  the  speech  input  and 
which  the  viewer,  possibly  only  after  much  practice,  is  expected  to 
recognize. 

The  purpose  of  the  present  project  is  eventually  to  develop 
a  display  system  which  can  be  used  as  a  visual  feedback  link  for  pronun- 
ciation.  At  the  most  advanced  level,  we  might  have  a  system  which  would' 
analyze  the  user's  utterance,  compare  it  with  some  standard,  and  then 
flash  on  a  "yes"  °?   "no"  light.   However  this  would  involve  a  much 
better  knowledge  of  speech  and  the  speech  mechanism  than  is  currently 
available.   It  would  also  provide  no  information  about  what  was  particulrl 
wrong  about  the  utterance.   Thus  the  purpose  of  the  present  project  was 
to  eventually  develop  a  visual  display  system  which  would  present  the 
transformed  image  of  the  user's  utterance  along  with  an  image  of  the 
standard.   The  standard  might  be  an  idealized  form  generated  by  the 
display  unit  or  it  could  be  the  version  just  spoken  by  an  instructor. 
In  either  case  it  would  be  the  task  of  the  user  to  correct  the  image 
of  his  version  by  repronouncing  it  until  it  approached  the  given  standar 
to  within  the  appropriate  tolerances. 

Such  a  system  could  be  used  in  any  situation  in  which  a  person 
requires  a  visual  corrective  feedback  path  to  improve  his  speech.   One 
excellent  example  is  that  of  people  who  have  been  deaf  from  a  very 
early  age.   Because  they  are  unable  to  hear  their  own  voice  or  the  voice 
of  others,  it  is  very  difficult  for  them  to  learn  correct  pronunciation. 
A  visual  feedback  device  would  be  very  helpful  in  such  a  situation.  A 
second  example,  though  not  as  desperately  important,  would  be  in  the 
area  of  foreign  language  teaching  in  which  the  visual  feedback  could  be 
used  as  a  supplement  to  conventional  language  training. 


In  order  to  develop  this  type  of  display  system,  several  steps 
must  be  taken: 

1)  A  suitable  transformation  must  be  found  to  transform  the 
spoken  speech  input  into  some  format  capable  of  being  displayed. 

2)  Depending  on  the  type  of  display  chosen,  tolerances  must 
be  developed  so  that  it  is  possible  to  tell  when  two  spoken  utterances 
are  acceptably  close.  , 

3)  A  suitable  technique  for  instructing  students  in  the  use 
of  the  display  must  be  developed  since  it  is  doubtful  that  any  of  the 
displays  will  be  suitable  for  use  without  some  period  of  instruction 
and  practice. 

The  purpose  of  this  study  was  to  investigate  various  types 
of  speech  displays,  to  produce  acceptable  simulations  of  several  of 
these  displays  using  a  computer-driven  graphics  display  system,  to 
develop  some  type  of  standardized  evaluating  procedure  for  speech 
displays,  and  apply  this  standard  procedure  to  certain  selected  types  of 
displays . 

The  remaining  sections  of  this  report  can  be  read  more  or 
less  independently.   Section  2  is  an  elementary  discussion  of  the 
characteristics  of  speech  with  an  emphasis  on  those  details  which  can 
cause  trouble  in  speech  recognition  and  speech  display  systems.   Section 
3  traces  the  history  of  the  development  of  the  various  types  of  speech 
displays.   Section  k   contains  a  discussion  of  the  simulation,  testing, 
and  evaluation  procedures  to  be  used  in  the  study. 

Sections  5  and  6  contain,  first  a  description  of  the  various 
displays,  and  then  a  summary  description  of  the  computer  programs  used 
in  the  simulation.   A  more  detailed  description  of  each  program,  including 


the  listings  and  various  test  programs,  can  be  found  in  Nordmann  [1971]. 

Section  7  discusses  the  results  of  a  preliminary  evaluation 
study  while  section  8  summarizes  the  results  and  conclusions  of  the 
study  and  outlines  further  possible  avenues  of  research. 

Section  9  contains  the  list  of  references  used  in  the  report. 


Chapter  2 
CHARACTERISTICS  OF  SPEECH 
2.1  Problems  in  Speech  Analysis 

Speech  processing  devices  have  long  been  plagued  with  various 
problems  which  result  from  the  characteristics  of  speech  itself  and  from 
the  effects  of  individual  speaker  differences.   As  Liberman,  et  al. [1967a] 
have  explained,  "the  sounds  of  speech  are  a  special  and  especially  efficient 
code  on  the  phonemic  structure  of  language,  not  a  cipher  or  alphabet". 
What  this  means  is  that  the  phonemic  message  being  transmitted  is  highly 
restructured  at  the  level  of  sound.   As  a  result,  the  speech  signal 
characteristics  of  a  given  phonemic  unit  vary  greatly  according  to  context. 

The  basic  biological  reason  for  the  recoding  is  the  fact  that 
both  the  ear  and  the  vocal  articulators  are  slow  speed  devices,  so  that 
in  order  to  deliver  information  at  a  higher  rate,  it  is  necessary  to 
operate  in  parallel  at  both  ends  of  the  communication  channel.   Thus  a 
given  speech  characteristic  will,  in  general,  give  information  about 
more  than  one  phoneme  and  a  given  phoneme  will  be  determined  by  more 
than  one  particular  set  of  speech  characteristics.   Obviously  this 
characteristic  of  speech  greatly  complicates  any  attempts  at  speech 
processing. 

Bobrow  and  Klatt  [1968]  have  discussed  a  variety  of  the 
more  mundane  problems  involved  in  speech  processing.   Some  of  these 
problems  are  as  follows: 

l)   The  intensity  range  from  one  utterance  to  the  next  varies 
tremendously  due  to  different  amounts  of  vocal  effort  on  the  part  of 
the  speaker  and  the  varying  distance  between  the  person  speaking  and 
the  microphone. 


2)  The  onset  time  of  am  unknown  word  is  not  a  simple  feature 
to  detect  reliably.   This  is  true  especially  for  certain  initial  voice- 
less consonants.   It  is  also  fairly  difficult  to  separate  the  various 
phonemes  which  make  up  an  utterance  because  the  parallel  operation  of 
the  speech  mechanism  does  not  produce  a  clear  cut  phoneme  boundary.   The 
most  successful  methods  developed  so  far  (e.g.  Reddy  [1966],  Hughes  and 
Hemdal  [1965],  Sakai  and  Doshita  [1963],  Otten  [l96Uc],  etc.)  involve 
the  establishment  of  certain  parameters  of  the  speech  signal  which  are 
measured  over  extremely  short  periods  of  time.   The  behavior  of  these 
parameters  from  one  time  interval  to  the  next  then  serves  to  establish 
whether  the  particular  interval  is  the  beginning  of  a  new  phoneme  or  a 
continuation  of  the  previous  one. 

3)  The  duration  of  a  word  is  highly  variable.   In  addition,  an 
increase  in  speaking  rate  is  not  accompanied  by  decreasing  the  length  of 
time  for  each  phoneme  by  the  same  proportional  amount.   For  example,  the 
time  needed  to  pronounce  stop  consonants  such  as  "p"  or  "b"  is  not  as 
greatly  affected  by  changes  in  speaking  rate  as  is  the  time  needed  for 
vowels.   Thus  the  time  normalization  problem  is  non-trivial. 

h)      Variations  in  stress  and  accents  can  greatly  change  the 
acoustical  properties  of  the  speech  signal. 

5)   Each  speaker  has  a  different  vocal  cavity  configuration  and 
as  a  result,  each  speaker  generates  a  speech  signal  with  a  different 
spectral  configuration. 

These  problems,  although  originally  discussed  in  the  context 
of  speech  recognition,  are  also  critical  sources  of  variance  in  speech 
displays.  In  order  to  produce  an  effective  display  some  means  must  be 
found  for  reducing  or  normalizing  the  effects  just  mentioned  and 


accentuating  the  effects  which  are  relevant  to  distinguishing  between 
different  phonemes  and  words. 

In  the  system  being  proposed  this  will  be  done  by  using  two 
displays  where  the  first  display  is  produced  by  the  subject  and  the 
second  is  presented  as  a  standard.   The  task  of  the  subject  is  to  com- 
pare the  two  displays  and  to  decide  in  what  particulars,  if  any,  they 
differ.   It  is  hoped  that  most  of  the  normalization  problems  can  be 
solved  by  a  combination  of  using  the  proper  physical  display  and  train- 
ing the  human  observer  to  perform  the  proper  pattern  recognition 
tasks.   After  sufficient  training  the  subjects  should  be  capable  of 
making  the  proper  generalizations  between  two  displays  and  determining 
the  relevant  points  of  difference  and  similarity. 
2.2  Significant  Parameters  of  Speech 

In  order  to  make  the  observers'  task  as  easy  as  possible,  the 
display  should  present  only  those  speech  parameters  which  are  necessary 
for  the  recognition  of  the  speech  itself.   Over  the  past  twenty-five 
years  a  variety  of  research  has  been  carried  out  in  the  search  for 
these  "significant  parameters". 

One  of  the  more  important  features  is  the  frequency  structure 
of  the  speech  wave.   This  structure  typically  peaks  at  three  or  four 
frequencies  due  to  the  resonating  effects  produced  by  the  oral  cavity 
during  the  production  of  speech.   These  peaks  are  called  formants 
(Potter  [19^7]  originally  called  them  "hubs")  and  are  most  prominent 
during  vowels  and  other  voiced  sounds.   They  are  numbered  beginning 
with  the  lowest  frequency  first. 

Although  the  absolute  frequency  ranges  of  the  various  formants 
overlap  from  one  speaker  to  another  and  from  one  utterance  to  another 
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by  the  same  speaker  (Campanella,et  al[l965]),  the  relative  position 
of  these  formants  appear  to  he  important  in  determining  steady  state 
sounds  such  as  vowels  (Potter  [l9*+7]»  Fry  [1958]).   In  particular,  it 
appears  that  the  relationship  of  the  formant  frequencies  of  a  given 
vowel  to  the  formant  frequencies  of  the  other  vowels  spoken  by  the 
same  speaker  are  important  in  the  identification  of  that  vowel  (Ladefoged 
[1957]).   Thomas  [1966]  has  also  shown  that  the  second  formant  is  the 
most  important  in  this  respect. 

An  even  more  important  feature  appears  to  be  the  transitions 
which  the  formants  make  during  speech.   These  transitions  occur  as  the 
vocal  apparatus  changes  its  configuration  in  order  to  pronounce  the  next 
phoneme  in  a  given  word.   The  Haskins  Laboratories  have  done  a  considerable 
amount  of  work  in  this  area  by  using  a  speech  synthesis  technique,  in 
which  various  formant  structures  are  converted  to  speech,  and  then  checking 
this  synthetic  speech  in  its  similarity  to  real  speech  (DeLattre  et  al. 
[1955],  Harris  et  al.  [1958],  Liberman  [1957],  Liberman  et  al.  [195*0, 
Liberman,  et  al.  [19^8],  etc.).   J.  P.  Radley  [1956]  has  criticized  this 
technique  in  that  it  used  synthetic  speech,  but  when  he  performed 
analyses  of  real  speech,  many  of  his  results  were  similar.   A  summary 
of  the  cues  which  are  useful  in  studying  formant  structure  is  given  in 
Liberman,  et  al.  [1959]. 

In  addition  to  working  on  the  transitions,  Radley  noted  that 
sound  bursts  in  the  high  frequency  region  were  also  important,  especially 
in  consonants  such  as  "p" ,  "t" ,  and  "k".   Halle,  et  al.  [1957]  and 
Fry  [1958]  have  also  discussed  this  and  Fry  observed  that  it  is  necessary 
to  measure  the  duration  of  the  noise  as  well  as  its  spectral  qualities. 

A  different  method  of  characterizing  speech  has  been  proposed 
by  Roman  Jakobson,  Fant  and  Halle  (see  Jakobson,  Fant,  and  Halle  [1952] 


and  Jakobson  and  Halle  [1956]).   This  method  sorts  out  sounds  using 
decisions  based  on  the  presence  or  absence  of  certain  distinctive 
features  such  as  voicing,  nasalization,  etc. 

Table   1  gives  a  partial  listing  of  some  distinctive  features 
and  their  values  for  certain  phonemes.   Various  authors  differ  as  to  what 
is  included  in  the  list  of  distinctive  features.   The  list  in  figure  1 
is  a  composite  of  several  different  lists. 

The  important  point  as  far  as  speech  recognition  is  concerned 
is  that  the  features  can  be  determined  independently  and  each  has  only 
a  few  possible  values  (usually  only  2).   This  makes  these  features  an 
ideal  analysis  method  since  the  values  of  the  various  features  can  be 
determined  from  the  speech  wave  without  resorting  to  highly  precise 
measurements . 

In  general  the  use  of  distinctive  features  has  been  somewhat 
successful  in  speech  recognition  (Hughes  [1961]  and  Hughes  and  Hemdal 
[1965])  but  has  found  only  limited  use  in  speech  displays.   This  latter 
fact  may  be  due  to  the  difficulty  of  producing  an  adequate  display  of 
8  or  10  variables.   In  the  one  example  known  to  this  author  (Upton  [1968]) 
the  display  was  specifically  designed  as  a  supplement  to  normal  lipreading 
and  as  a  result  only  displayed  those  features  which  were  specifically 
hard  to  see  from  lip  movements  alone. 

In  addition  to  the  various  types  of  information  already  mentioned, 
there  are  other  types  of  speech  parameters  which  might  prove  useful. 
Potter  [19^5]  has  suggested  that  pitch  must  be  shown  if  a  display  is  to  be 
used  for  speech  correction.   This  is  certainly  one  of  the  speech  functions 
which  is  most  often  involved  in  attempts  to  correct  the  poor  speech  habits 
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of  the  deaf.   Its  significance  in  speech  display  applications  which  do 
not  involve  deaf  subjects  is  probably  not  as  great,  although  it  still 
may  be  of  some  importance. 


Chapter  3 
HISTORY  OF  SPEECH  DISPLAYS 
3.1  Early  Displays 

The  first  devices  which  were  used  to  make  speech  visible  were 
mechanical  in  nature  and  were  used  for  speech  correction  purposes.   Several 
types  were  in  existence  in  the  early  1900's  which  utilized  flames  into 
which  the  subject's  speech  was  directed  by  means  of  hollow  tubes.   The 
successive  waves  of  dense  and  rarified  air  caused  variations  in  the  number 
of  ions  available  to  the  flame  and  consequently  caused  the  flame  to  flicker 
in  a  manner  characteristic  of  the  speech  qualities  of  the  subject.   Abramson 
[1952]  describes  several  of  these  devices  and  how  they  are  used  in  speech 
therapy. 

Characteristically,  these  devices  were  able  to  produce  only  a 
very  gross  display  of  the  speech  and  about  the  only  thing  that  could  be 
determined  from  them  was  the  pitch,  presence  of  nasalization,  or  the 
relative  volume  of  the  speech.   However,  this  is  often  quite  helpful  and 
due  to  the  low  cost  of  these  devices,  some  of  them  are  still  in  use. 

Another  very  early  type  of  display  was  an  ordinary  speech  signal 
(i.e.  microphone  output)  vs.  time  display.   Abramson  [1952]  and  Pronovost 
[19^7]  in  their  surveys  on  visual  speech  aids  mention  oscillographic  dis- 
plays but  generally  these  displays  do  not  give  much  useful  information. 
Flowers  [1916]  was  able  to  produce  one  of  these  displays  in  19l6  without 
the  use  of  an  oscilloscope  by  using  a  string  galvanometer.  An  arc  lamp 
projected  the  shadow  of  the  galvanometer's  silver-plated  quartz  fiber  onto 
a  perpendicular  slit  behind  which  a  photographic  film  was  moving  perpendicular 
to  the  motion  of  the  string.   When  a  subject  spoke  into  a  microphone  attached 
to  the  galvanometer,  a  picture  of  the  speech  signal  as  a  function  of  time 

was  produced. 
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There  were  several  other  types  of  devices  which  were  discussed 
by  both  Abramson  and  Pronovost  which  have  also  been  called  visual  speech 
aids.   However,  in  many  cases  these  devices  are  quite  passive.   Two  examples 
are  the  so-called  "Lite-O-Letter" ,  a  game-like  device  utilizing  a  display 
of  transparent  letters  which  can  be  lit  by  push  buttons  and  the  "Chromovox" 
(also  described  by  Cavanagh  [1951])  which  involved  a  moving  display  of 
words  and  pictures  to  be  spoken  by  the  deaf  pupil  and  a  series  of  lights 
controlled  by  the  teacher  and  used  for  reinforcement.    Since  these  devices 
depend  entirely  on  the  skill  of  a  speech  therapist  to  judge  the  correctness 
of  the  speech  sound  and  activate  the  proper  indicator,  they  will  not  be 
considered  any  further  here. 
3.2  Spectrographic  Displays 

The  emphasis  on  the  more  modern,  electronic  displays  began  with  a 
Bell  Telephone  Laboratories  project  which  was  started  early  in  19^-1.   A 
device  for  the  visual  translation  of  sound  was  needed  in  order  to  carry  on 
some  special  studies  in  speech  distortion  which  were  part  of  the  war  effort. 
Once  the  needs  of  the  military  had  been  accomplished,  however,  it  became 
possible  to  work  on  the  device  with  the  view  of  producing  a  form  of  "visual 
hearing". 

The  device  itself  was  called  "the  sound  spectrograph"  and  produced 
a  three-dimensional  representation  of  the  speech  signal  in  which  time  was 
plotted  on  the  horizontal  axis  and  frequency  on  the  vertical  axis  with  the 
intensity  of  the  particular  frequency  component  at  a  given  time  being 
represented  by  the  intensity  of  the  display  at  that  point.   Later  a  variety 
of  displays  were  developed  using  three-dimensional  formats  with  the  time 
dimension  being  represented  along  the  horizontal  axis.   For  the  remainder 
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of  this  paper  this  display  format  type  will  be  referred  to  as  a  linear 
time  display. 

The  first  published  reports  of  spectrographic  linear  time 
displays  began  to  appear  as  soon  as  the  war  ended  and  for  several  years 
thereafter  (Kopp  [19^6],  Peterson  [195*0,  Potter  [19^6],  Riesz  and  Schott 
[19^6]  and  Steinburg  and  French  [19^6]).   There  were  actually  several 
different  types.   One  of  the  first  types  (Koenig,  et  al.  [19^6])  produced 
a  permanent  record  by  repeatedly  analyzing  the  speech  signal  with  a 
variable  center  frequency  filter  and  displaying  the  rectified  filter 
output  on  a  piece  of  paper  by  means  of  a  variable  intensity  stylus. 
Another  model  (Dudley  and  Gruenz  [19^6])  used  a  moving  phosphor  belt  and 
parallel  filters  to  display  the  signal  in  real  time.   Still  a  third 
(Mathes  et  al.  [l9*+9]»  Johnson  [19^6]  used  a  magnetic  disk  and  CRT  system 
which  recorded  the  signal  and  then  replayed  it  many  times  at  very  high 
speed  using  a  variable  filter  to  give  a  rapid  CRT  display. 

In  19^7,  Potter,  Kopp  and  Green  published  the  first  edition  of 
their  book,  Visible  Speech  [19^T]»  which  described  the  work  they  had  done 
at  Bell  Laboratories.   They  had  attempted  to  teach  people  to  read  the 
spectrograms  they  had  produced  much  as  you  would  read  a  book. 

They  began  with  a  group  of  five  young  women  in  the  fall  of  19^3 
The  instruction  schedule  called  for  two  hours  of  group  instruction  and 
one  hour  of  individual  study  each  day.   The  following  year  four  more  your, 
women  were  added  to  the  group  and  also  a  male  electrical  engineer  who  was 
congenitally  deaf. 

The  learning  rate  for  the  newcomers  to  the  group  was  about  3-1/ 
words  per  hour  of  study.  The  engineer  eventually  achieved  a  vocabulary  C 
800  words.  The  four  female  newcomers  achieved  between  100  and  300  words 
but  they  had  not  practiced  as  long.   Within  the  limits  of  their  vocabula: 


J. 


15 


the  visible  speech  class  members  were  able  to  converse  by  enunciating 
clearly  and  at  a  fairly  slow  rate.   Potter  remarked  that  intelligibility 
was  roughly  equivalent  to  a  very  noisy  telephone  connection. 

Later  on  the  original  Visible  Speech  Translator  was  moved  to 
the  Detroit  School  for  the  Deaf,  where  Kopp  and  Kopp  [ 1963a,  1963b]  used 
it  to  teach  speech  intonation  and  stress  to  deaf  children.   Similar 
versions  based  on  its  design  were  fabricated  at  other  locations  as 
well  (e.  g.  House,  et  al.  [1968]).    In  1965-1966  a  transistorized 
version  of  the  translator  was  produced  at  Bell  Telephone  Laboratories. 
Stark,  et  al.  [1968]  have  reported  on  its  use  as  a  training  aid  for  deaf 
subjects.   They  found  that  especially  in  the  case  of  younger  subjects, 
the  display  was  of  significant  help  but  that  supplemental  speech 
instruction  was  also  necessary. 

As  interest  in  speech  spectrograms  grew,  various  other  groups 
designed  devices  for  producing  them.   The  Haskins  Laboratory  began  speech 
investigations  using  synthetic  spectrograms  and  a  "pattern  playback" 
device  which  was  a  "spectrograph"  in  reverse.   Ramaswamy  [1962],  Harris 
and  Waite  [1963],  Presti  [1957,  1966]  and  many  others  developed  spectro- 
graphs of  varying  speeds.   However,  they  all  produced  the  same  general 
type  of  display,  differing  only  in  the  way  the  display  was  produced. 
3.3  Spectrographic  Variations 

Unfortunately  there  were  several  problems  with  the  sound  spectro- 
graph.  In  addition  to  the  poor  overall  quality  of  transmission,  one  of  the 
major  problems  was  that  some  of  the  more  important  features  which  were 
necessary  for  distinguishing  between  different  words  were  not  always  easy 
to  see  on  the  display.   Therefore,   as  time  went  by,  various  imporovements 
were  attempted. 
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Koenig  and  Ruppel  [19^8]  describe  several  methods  for  increasing 
the  visible  dynamic  range  of  the  spectrogram.   One  method  involved  using  a 
dot  display  where  the  density  of  the  dots  represented  the  intensity  of  the 
specific  frequency  component.   Another  method,  which  was  also  described 
by  Prestigiacomo  [1962],  used  contours  to  display  the  intensity.   A  third 
method  which  was  further  elaborated  by  Kersta  [19U8]  involved  reducing  the 
spectrogram  to  a  frequency  vs.  frequency  magnitude  plot  only  for  specific 
instants  of  time.   This  allows  the  frequency  distribution  to  be  shown  in 
more  detail  but  drastically  restricts  the  number  of  time  intervals  shown. 

Another  modification  was  one  by  Kock  and  Miller  [1952]  in  which 
a  differentiated  version  of  the  spectrogram  was  used.   The  display  involved 
the  differentiation  of  the  time-amplitude  pattern  for  different  points  on 
the  spectrum.   The  advantage  claimed  for  this  method  was  that  rapid  changes 
in  spectrum  content,  which  tend  to  contain  the  most  phomemic  information, 
show  up  more  easily. 

D.  E.  Wood  and  T.  L.  Hweitt  have  described  another  modification 
[1963,  I96U]  in  which  a  real  time  spectrograph  was  used  to  display  just  the 
peaks  of  the  spectral  cross  sections.   This  eliminated  the  need  for  intensity 
modulation  of  the  visual  display.   This  display,  as  does  the  Kock  and  Miller 
display,  emphasizes  the  formant  frequency  excursions  since  it  is  not  cluttere 
with  as  much  "background"  data.   In  use  as  a  speech  analyzer  this  display 
was  quite  informative.   However,  it  was  not  completely  satisfactory  in  the 
case  of  stop-consonant  bursts  and  other  such  signals. 
3. k     Other  Linear  Time  Displays 

As  more  work  was  done  with  spectrograms  their  limitations 
became  increasingly  apparent.   Although  they  were  a  good  means  of  dis- 
playing the  detailed  information  for  an  analysis  of  speech,  they  could 
not  be  read  easily  or  quickly.   As  a  result,  several  other  linear 
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time  displays  "were  tried.   These  displays  used  the  same  format  hut 
processed  the  speech  signals  using  different  techniques  in  the  hope 
that  they  would  he  easier  to  "read". 

A  display  hy  Biddulph  [195*+]  and  an  earlier  one  hy  Bennett 
[1953]  utilized  autocorrelation  functions  and  displayed  the  delay 
parameter,  x,  vs.  time  with  the  magnitude  of  the  autocorrellation 
function  "being  shown  as,  the  intensity.   As  it  turned  out,  this  display 
was  actually  harder  to  read  than  a  spectrogram  since  it  became  very 
sensitive  to  non-critical  information  in  the  speech  signal.   Huggens 
[195*0  and  Stevens  [1950]  have  each  given  a  detailed  analysis  and 
critique  of  this  method.   Huggens  shows  that  slight  changes  in  pitch 
may  cause  large  changes  in  the  display.   One  other  undesirable  character- 
istic of  the  display  was  that  it  was  a  quadratic  function  of  the  frequency 
components  and  thus  a  large  dominant  frequency  could  obscure  the  effects 
of  smaller  amplitude  frequency  components. 
3.5  Two -Dimensional  X-Y  Displays 

All  of  the  displays  discussed  so  far  have  been  linear  time 
displays  utilizing  three  display  parameters.   Another  type  of  display 
format  which  has  been  developed  involves  only  two  dimensions  in  which 
time  is  generally  omitted  as  a  direct  display  parameter.   Instead  these 
displays  use  the  chosen  parameters  as  "x"  and  "y"  inputs  in  a  plotter 
(usually  a  CRT)  which  then  plots  the  resulting  point  as  the  parameters 
vary  with  time.   By  using  time  only  in  this  indirect  sense,  the  origin- 
ators of  these  x-y  displays  hoped  to  eliminate  the  effect  on  their  displays 
of  varying  time  duration  between  different  utterances  of  the  same  word. 

One  type  of  x-y  display  which  was  developed  utilized  90°  phase 
shifting  circuits.   In  this  type  of  display  the  processing  hardware 
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converted  the  original  speech  input  into  two  output  signals  which 
were  90   out  of  phase  with  one  another.   Lerner  [1952,  1959] 5  Vilbig 
[195^] 9  and  Barton  and  Barton  [1963]  have  all  described  displays  of 
this  type.   These  displays  have  been  tested  by  several  people  but  the 
results  are  inconclusive.   J.  E.  Connor  [1955]  and  F.  E.  Fabian  [1955] 
evaluated  the  effectiveness  of  Lerner 's  display  in  speech  correction 
and  claimed  that  it  was  just  as  good  as  but  no  better  than  "conventiona' 
speech  therapy  in  the  case  of  articulation  disorders  but  of  no  signifies 
help  in  voice  improvement.   However,  in  a  later  preliminary  study, 
Pronovost  [196U]  felt  that  this  display  showed  some  promise  in  improvin 
the  articulatory  proficiency  of  deaf  children.   Unfortunately,  a 
subsequent  study  (Pronovost,  et  al.  [1968])  was  unable  to  produce  more 
definite  results. 

Pyron  and  Williamson  [196^]  gave  a  critique  of  Barton  and 
Barton's  apparatus  and  indicated  what  they  thought  was  the  general 
problem  with  all  such  techniques,  namely  that  they  work  best  on  continu-u 
sounds  (i.e.  vowels  and  nasal  consonants)  and  are  very  poor  on  transitu, 
(i.e.  consonants)  which  carry  a  high  proportion  of  the  speech  informati-. 

A  different  type  of  x-y  display  has  been  developed  in  Switzer 
land  by  Dreyfus-Graf  [19^6,  19^8,  1950a,  1950b].   This  display  uses  a 
system  of  filters  and  differentiators  to  produce  pulses  which  control 
the  movement  of  an  ink  pen.   The  author  claimed  that  the  resulting 
squiggles,  which  do  appear  fairly  consistent  for  sustained  vowels,  coul 
be  used  as  a  phonetic  shorthand.   However,  as  far  as  this  author  knows, 
there  has  been  no  report  on  the  use  of  this  device  with  a  normal  speech 
input . 
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Another  x-y  display  using  a  CRT  has  been  reported  by  Plomp, 
Pols  and  Van  de  Geer  [1967].   They  analyzed  15  Dutch  vowels  by  using  a 
bank  of  18  filters  to  process  the  speech  signals  and  studying  the  differ- 
ences between  the  vowel  spectra.   The  resulting  dimensional  analysis 
yielded  four  dimensions  which  accounted  for  96. h%   of  the  total  variance 
once  the  between-subject  variance  had  been  allowed  for.   The  authors 
suggested  using  plots  of  the  first  dimension  vs.  the  second  as  an  aid  for 
the  deaf.   An  oscilloscope  display  for  the  vowels  has  been  produced  but 
work  is  only  beginning  on  the  consonants.   This  method  was  suggested  as  an 
alternative  to  a  type  of  display  in  which  the  frequencies  of  the  first  and 
second  formants  for  various  vowels  are  plotted  as  points  or  regions  on  a 
two-dimensional  graph  (see  for  example  Davis  [1952],  Foulkes  [1961],  Hughes 
[1965]  or  Majewski  [1967])-  Although  this  type  of  a  representation  is  very 
appropriate  for  vowel  sounds  it  has  not  met  with  much  success  in  the 
representation  of  consonants.   It  remains  to  be  seen  if  Plomp,  et  al. 
will  be  able  to  apply  their  technique  to  the  consonants. 

Cohen  [1968]  has  described  an  x-y  display  developed  by  Arthur 
D.  Little,  Inc.,  which  was  made  from  a  converted  TV  set.   It  used  a  type 
of  frequency  analysis  somewhat  similar  to  the  cepstrum  analysis  technique 
(Noll  [196U,  1967])  in  which  the  log  of  the  output  of  a  spectral  analysis 
is  subjected  to  another  "spectral  analysis"  to  determine   "shape"  character- 
istics of  the  original  spectrum.   The  ADL  display  makes  use  of  10  filter 
channels  and  by  the  use  of  various  weighting  factors  resolves  their  out- 
puts into  sine  and  cosine  components  of  the  frequency  spectrum  envelope. 
These  two  components  are  then  plotted  as  the  x  and  y  coordinates  of  the 
display.   The  net  result  is  somewhat  analogous  to  a  two-formant  display 
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but  the  problem  of  formant  identification  is  avoided.   The  device  is 
currently  undergoing  evaluation  for  use  by  deaf  people  for  speech 
improvement . 
3.6  Zero-crossing  Displays 

In  addition  to  classifying  visual  displays  according  to  their 
physical  format,  they  can  be  separated  according  to  the  type  of  processing 
used  on  the  speech  signal.   Thus  we  have  already  discussed  spectrographic , 
correlation  and  phase  splitting  displays,  among  others.   Another  very 
common  type  of  processing  is  the  extraction  of  zero-crossing  information. 
One  of  the  reasons  this  type  of  processing  is  so  popular  is  that  it  can 
be  easily  performed  using  a  high  gain  amplifier  and  clipping  circuit,  and 
is  thus  cheaply  implemented. 

One  linear  time  display  version  of  a  zero-crossing  display  was 
developed  by  Chang,  et  al.  [ 1951b]  and  further  developed  by  Sakai  and 
Inoue  [i960].   It  was  called  an  "intervalgram".   This  display  used  the 
time  intervals  between  zero-crossings  or  between  zero-slopes  (i.e.  zero- 
crossings  of  the  differentiated  signal)  as  a  parameter  to  be  plotted  against 
time.   The  display  produced  a  dot  for  each  interval  between  zero-crossings 
where  the  horizontal  position  of  the  dot  was  determined  by  the  ralative 
time  position  of  the  interval  and  its  vertical  position  by  the  frequency 
of  the  sinusoidal  signal  which  would  have  produced  an  equivalent  interval 
between  zero-crossings.   The  result  is  a  halftone  display  consisting  of 
dots  which  look  somewhat  similar  to  a  spectrogram. 

C.  C.  Bridges  [I96U]  has  produced  a  more  simple  linear  time 
zero-crossing  display  by  simply  plotting  the  zero-crossing  rate  as  a 
function  of  time  on  an  oscilloscope. 

The  main  justification  for  using  these  parameters  was  the 
finding  by  Licklider  and  Pollack  [19^8],  Licklider  [1959],  and  others, 
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that  highly  clipped  speech  signals,  and  highly  clipped  differentiated 
speech  signals  were  still  quite  intelligible  to  the  human  ear.   Thus, 
since  these  clipped  signals  contain  only  interval  information  about  zero- 
crossings  or  zero-slopes,  a  display  of  this  information  should  contain 
all  the  essential  information  of  speech.   In  addition,  of  course,  these 
parameters  were  much  easier  to  obtain  than  spectrograms  or  correlation 
patterns.   However,  the  authors  were  unable  to  show  that  intervalgrams 
were  any  easier  to  read  although  Sakai  and  Doshita  [1963,  1968]  did  use 
this  technique  for  speech  analysis  and  recognition. 

Pyron  and  Williamson  [1965]  have  developed  an  x-y  display 
utilizing  zero-crossing  information  in  which  they  extracted  the  amp- 
litude envelope  of  the  speech  signal  as  well  as  the  rate  of  zero-crossings 
and  the  rate  of  zero-slopes.   They  experimented  with  plots  of  amplitude 
vs.  zero-crossings,  zero-crossings  vs.  zero-slopes,  and  amplitude  vs. 
zero-slopes,  but  since  they  discovered  that  the  latter  gave  consistently 
clearer  and  more  characteristic  patterns,  most  of  their  results  are 
concerned  with  that  form.   As  the  authors  noted  in  their  report ,  Chang 
[1951a]  has  provided  a  theoretical  analysis  and  experimental  evidence  to 
show  that,  in  speech  signals  with  a  pronounced  formant  structure,  the 
rate  of  zero-crossings  corresponds  to  the  first  speech  formant  while 
the  rate  of  zero-slopes  corresponds  to  the  second  speech  formant.   Thus, 
their  display  is  analogous  to  an  amplitude  envelope  vs.  second  formant 
x-y  display. 

Ewing  and  Taylor  [1969]  have  duplicated  Pyron  and  Williamson's 
display  and  have  attempted  to  improve  upon  their  results.   They  initially 
worked  with  a  zero-crossing  vs.  zero-slope  type  of  display  with  the 
eventual  aim  of  generating  patterns  which  could  be  recognized  by 
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computer.   They  also  tried  adding  a  time  sweep  to  both  axes  which  gave 
the  display  a  diagonal  rise  across  the  face  of  the  CRT.   However,  they 
felt  their  most  promising  version  was  one  in  which  the  difference  between 
the  zero-crossing  and  zero-slope  signals  was  plotted  vs.  time.   In  this 
case  they  still  did  not  get  the  desired  results  but  they  felt  that  this 
was  due  to  poor  comparison  methods  during  the  recognition  phase  of  their 
procedure. 
3.7  Pitch  Extracting  Displays 

Another  type  of  processing  used  in  producing  speech  displays 
is  pitch  extraction.   As  early  as  the  1930' s,  Coyne  [ 1938a,  1938b]  and 
Timberlake  [1938]  reported  on  a  voice  pitch  indicator  using  ik   to  20 
mechanical  band-pass  filters  (i.e.  tuning  forks)  with  lamps  which  in- 
dicated the  pitch  frequency.   Its  use  in  South  African  schools  for  the 
deaf  has  shown  good  results  for  younger  subjects  but  negative  results 
for  older  subjects  with  settled  voice  habits. 

Dolansky  [1955]  has  described  a  pitch  extracting  device  based 
on  a  time  domain  analysis.   The  descendents  of  this  device  have  been 
used  to  produce  displays  which  have  been  used  in  several  experiments. 
These  displays  are  linear  time  displays  but  only  use  two  dimensions. 
Time  is  on  the  horizontal  axis  with  the  position  on  the  vertical  axis 
indicating  the  pitch  period  of  the  incoming  speech.   The  intensity  of  the 
display  is  turned  off  when  no  voicing  is  present,  but  other  than  this,  is 
independent  of  the  speech  input. 

F.  Anderson  [i960]  has  used  a  version  of  Dolansky' s  pitch 
extractor  utilizing  a  revolving  CRT  with  a  view  panel,  cut  so  that  only  a 
portion  is  displayed  on  the  vertical  axis  against  a  continuous  horizontal 
time  base.   The  CRT  uses  a  long-persistence  phosphor  so  that  the  display 
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can  "be  seen  for  five  seconds.   The  display  was  used  with  a  group  of 
eight  children  from  ages  8  to  12,  with  hearing  losses  of  60  db  or  more. 
It  appears  to  have  been  somewhat  useful  although  the  author  did  not  go 
into  detail  about  it. 

The  group  headed  by  Dolansky  at  Northeastern  University  con- 
tinued to  work  on  pitch  displays  (Dolansky,  et  al.  [1965],  Dolansky  and 
Phillips  [1966],  and  Phillips,  et  al.  {1968]).   They  performed  several 
studies  using  deaf  children  as  subjects  as  well  as  normal  hearing 
university  students.   The  results  indicated  that  the  display  was  of  some 
use  in  teaching  deaf  children  and  that  it  was  possible  to  use  the  dis- 
play as  a  visual  feedback  indicator  for  speech  pitch. 

A  variety  of  other  researchers  have  developed  pitch  extraction 
displays  (Gruenz  and  Schott  [19^9],  Plant  [i960],  Martony  [1968],  and 
others).   In  addition,  several  displays  have  been  made  which  incorporated 
a  pitch  display  along  with  some  other  type.   Stark's  spectrographic 
display  [1968],  mentioned  earlier,  uses  pitch  and  amplitude  as  well  as 
spectrographic  information.   Pickett  and  Constam  [1968]  describe  a  multi- 
display  device  developed  at  the  Hearing  and  Speech  Center  of  Gallaudet 
College  which  in  addition  to  being  able  to  produce  a  pitch  display,  could 
also  generate  vowel  spectrum  indications,  intensity  vs.  pitch  displays,  and 
intensity  contours. 
3. 8  Miscellaneous  Formats 

In  addition  to  the  linear  time  and  x-y  display  formats,  there 
has  been  a  variety  of  other  types  of  attempts.   D.  E.  Williams  [1967]  has 
designed  a  light  bulb  display  which  consists  of  a  matrix  of  lights  and  an 
electronic  circuit  to  drive  it,  which  frequency  analyses  an  utterance  into 
10  frequency  regions  and  displays  the  results  in  a  bar  graph  form.   There 
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was  also  a  second  display  which  indicated  the  relative  length  of  time 
each  frequency  component  was  above  a  certain  threshold.   However,  the 
display  appears  to  be  valid  only  for  sustained  sounds  like  vowels  and 
even  in  these  cases  varies  tremendously  with  such  irrelevant  variables  as 
distance  from  the  microphone,  speech  rate,  etc. 

Hubert  W.  Upton  has  developed  a  wearable  eyeglass  speechreading 
aid  (Upton  [1968],  Picket  [1969],  Risberg  [1969])  which  detects  voicing, 
friction,  stops,  etc.   Miniature  lights  imbedded  in  the  eyeglasses  glow 
whenever  the  corresponding  speech  feature  is  present.   The  device  was 
specifically  designed  as  an  aid  to  lipreading  and  therefore  the  speech 
features  which  were  chosen  were  those  not  visible  on  a  speaker's  lips. 
The  designer  noted  that  although  the  analyzing  functions  did  not  work 
perfectly,  the  device  still  gave  a  significant  amount  of  information  not 
obtainable  by  lip  reading  alone. 

In  addition  to  these  displays,  there  are  several  other  types 
which  have  been  in  use  by  speech  therapists  but  which  do  not  fit  neatly 
into  any  of  the  categories  mentioned  so  far.   Risberg  [1968]  discusses  a 
variety  of  these  devices  which  he  helped  to  develop,  including  various 
types  of  indicators  for  fricatives,  s-sounds,  intonation,  rhythm,  and 
nasalization.   Some  of  these  displays  might  be  called  linear  time  displays, 
but  others  involve  simply  meters  or  lights  which  turn  on  when  a  given 
threshold  has  been  reached  for  the  quantity  being  measured.   The  primary 
principle  was  to  minimize  the  number  of  functions  displayed  by  a  single 
device.   This  was  done  both  to  decrease  the  cost  and  to  isolate  the  speech 
feature  to  be  controlled. 
3.9  The  Use  of  Speech  Displays 

Although  a  wide  range  of  speech  displays  have  been  developed  over 
the  past  twenty  years,  there  has  been  a  reluctance  on  the  part  of  speech 
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therapists  to  make  widespread  use  of  them.   The  reasons  for  this  have 
"been  "briefly  mentioned  above  and  center  on  the  cost  of  the  devices  and 
the  pedagogical  problems  which  they  produce.   From  a  cost  standpoint,  it 
is  easy  to  see  that  the  more  complicated  (and  thus  more  costly)  displays 
would  badly  distort  the  small  budgets  of  most  schools  for  the  deaf. 
The  pedagogical  resistance  might  be  a  little  harder  to  justify.   However, 
although  it  may  be  true  that  some  of  the  resistance  is  simply  a  result 
of  inate  conservatism  on  the  part  of  teachers  of  the  deaf,  it  is  also 
true  that  very  little  testing  has  been  performed  on  the  effectiveness  of 
the  various  display  types.   Thus  the  fears  of  these  teachers  toward  using 
untested  techniques  on  children  whose  futures  may  depend  on  them  are 
somewhat  justified. 

More  recently,  however,  the  situation  has  been  changing.   A 
variety  of  small  experiments  have  been  performed  to  determine  the  feasi- 
bility of  using  particular  displays  as  a  visual  feedback  link  to  replace 
the  auditory  feedback  link  which  has  been  destroyed  in  deaf  people.   The 
primary  goal  has  been  to  use  some  type  of  visual  display  to  indicate 
to  the  deaf  subject  how  correct  or  incorrect  his  pronunciation  actually 
is.   In  general,  these  studies  have  been  promising  for  younger  subjects, 
though  not  as  successful  for  older  subjects. 

The  tests  themselves  have  mostly  been  performed  by  specialists 
in  the  area  of  speech  training  and  have  involved  the  simpler  types  of 
displays.   Cost  is  the  obvious  reason  for  this  latter  fact.   This  same 
fact  also  makes  it  very  difficult  for  any  one  group  to  build  and  test 
more  than  one  or  two  displays  at  the  same  time.   As  a  result  there  has 
been  very  little  work  done  in  developing  general  testing  techniques  which 
could  be  applied  by  a  single  group  to  a  wide  variety  of  displays  in  order 
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to  determine  the  relative  effectiveness  of  the  different  types.   Happily 
this  trend  appears  to  be  reversing  as  can  he  seen  by  the  previously 
mentioned  development  of  systems  which  can  produce  more  than  one  display  '< 
type. 

Although  many  groups  have  been  able  to  use  speech  displays  as 
feedback  aids  in  speech  correction  for  the  deaf,  the  original  goal  of 
the  Bell  Laboratory  group,  i.e.  actually  reading  the  display,  has  yet 
to  be  achieved.   It  has  in  fact  been  suggested  by  A.  M.  Liberman, 
et  al.  [1967a]  of  the  Haskins  Laboratory,  that  "we  may  never  be  able  to 
perform  this  type  of  direct  conversion.   This  is  so,  they  maintain,  becaus 
there  is  no  simple  one-to-one  correspondence  between  the  characteristics 
of  the  speech  signal  and  the  phonemes  which  it  represents  (Liberman, 
et  al.  [1967b]).   Since  the  speech  signal  is  basically  a  complex  code 
as   opposed  to  a  simple  cipher,  the  phonemic  message  being  transmitted  is 
highly  restructured  at  the  level  of  sound.   As  a  result,  the  speech 
signal  characteristics  of  a  given  phonemic  unit  vary  greatly  according 
to  context. 

The  basic  biological  reason  for  the  recoding  is  the  fact  that 
both  the  ear  and  the  vocal  articulators  are  slow  speed  devices,  so  that 
in  order  to  deliver  information  at  a  higher  rate,  it  is  necessary  to 
operate   in  parallel  at  both  ends  of  the  communication  channel.   Thus  a 
given  speech  characteristic  will,  in  general,  give  information  about 
more  than  one  phoneme  and  a  given  phoneme  will  be  determined  by  more  than 
one  set  of  speech  characteristics. 

The  key  point  to  their  argument  against  the  readibility  of 
visible  speech,  however,  is  their  statement  that  although  a  decoder  of 
such  signals  obviously  exists,  it  appears  to  be  unalterably  linked  to  the 
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auditory  sensory  system.   Thus  although  it  might  he  possible  to  create 
displays  which  emphasize  the  important  key  features  of  the  speech  signal, 
it  does  not  appear  possible  to  produce  a  display  which  would  allow  the 
viewer  to  unconsciously  decode  the  signal  into  phonemes.   It  should  be 
noted  that  this  key  point,  by  the  authors'  admission,  appears  to  be 
true  only  because  in  20  years  of  experience  nobody  has  been  able  to 
learn  to  visually  decode  spectrograms  without  a  great  deal  of  conscious 
mental  effort. 

Recently,  however,  Lenneberg  [1967]  has  discussed  the  effect  of 
age  and  development  on  the  learning  of  a  language.   According  to  a 
variety  of  experiments  it  appears  that  the  development  of  speech  is  im- 
possible once  a  human  has  reached  approximately  the  age  of  puberty. 
Before  this  time,  humans  are  capable  of  learning  language  even  if  large 
portions  of  the  brain  which  are  normally  connected  with  this  process  are 
destroyed  by  disease  or  accident.   The  brain  seems  to  be  very  plastic  at 
this  age  and  highly  adaptable. 

As  a  result,  it  may  be  possible  that  the  proposition  put  forth 
by  the  Haskins  group  will  only  hold  for  adults  since  their  brains  have 
already  "frozen"  into  a  permanent  state.   This  could  also  explain  why 
younger  subjects  seem  to  get  the  most  help  from  feedback  type  displays. 
It  would  be  interesting  to  try  to  teach  a  deaf  child  to  read  visible 
speech  since  in  this  case  the  child's  brain  might  actually  be  able  to 
adapt  itself  to  decoding  the  visible  input. 

Be  this  as  it  may,  if  we  grant  the  fact  that  the  human  eye 
cannot  be  trained  to  become  an  automatic  speech  decoder  (at  least  once 
the  subject  passes  a  certain  age)  then  the  task  of  using  a  visual  dis- 
play as  a  speech  feedback  mechanism  for  adults  can  be  looked  at  from 


28 


two  positions.   If  the  feedback  device  actually  performs  the  decoding 
before  presenting  the  visual  display,  then  it  becomes  in  effect,  a 
speech  recognizer.   This  is  precisely  what  the  last  20  years  of  speech 
research  has  been  trying  to  achieve  but  without  too  much  success.    In 
addition,  it  would  not  be  very  helpful  in  the  present  task  since  it  would 
not  be  giving  the  information  which  a  poor  speaker  needs  to  correct  his 
pronunciation. 

We  can,  on  the  other  hand,  ignore  the  absolute  decoding 
problem  and  instead  concentrate  on  displaying  the  most  relevant  speech 
parameters  in  a  concise  manner.   In  this  case  the  observer  would  not 
necessarily  be  able  to  recognize  the  words  merely  from  the  display. 
However,  if  the  proper  parameters  are  displayed  in  an  easily  discerned 
manner,  it  should  be  possible  for  the  observer  to  detect  the  differences 
between  his  pronunciation  and  a  comparison  display  of  the  same  speech 
pronounced  properly.   This  is  the  eventual  goal  which  has  been  set  up 
for  this  project. 


Chapter  J4 
PROPOSED  STUDY 

The  eventual  aim  of  this  research  is  to  develop  a  computer 
driven  display  system  which  can  be  used  as  a  visual  feedback  link  to 
correct  mispronunciations  by  people  who  are  deaf  or  in  other  situations, 
such  as  language  training,  where  correcture  feedback  in  pronunciation  may 
be  desirable.   The  envisioned  system  would  present  two  displays  to  the 
user,  one  of  the  word  as  it  is  supposed  to  be  pronounced  and  one  as  it  is 
pronounced  by  the  user.   His  task  will  be  to  determine  if  they  are  accept- 
ably close  (this  may  be  possible  only  after  a  certain  amount  of  instruction 
and  practice)  and  if  not,  to  determine  which  parts  are  in  error  and  change 
his  pronunciation  accordingly. 

The  more  immediate  goal  of  this  particular  study  has  been  the 
development  of  a  generalized  computer  simulated  display  system.   This 
system  has  been  built  so  that  it  can  utilize  a  variety  of  speech  processing 
techniques  and  easily  produce  a  wide  range  of  speech  display  types.   In 
addition  several  of  these  displays  can  be  compared  with  one  another  to 
determine  which  of  them  is  most  effective  in  terms  of  presentation  of  rele- 
vant variables  and  ease  of  training  in  their  use. 

The  speech  display  simulation  system  has  been  implemented  on 
the  CDC  160^  installation  at  the  Coordinated  Science  Laboratory  at  the 
University  of  Illinois.   This  system  contains  a  high-resolution  variable 
intensity  CRT  display  equipped  with  facilities  for  taking  both  still  and 
moving  pictures. 

The  main  advantage  of  using  such  a  system  to  generate  speech 
displays  is  the  flexibility  inherent  in  a  computer  simulator.   Using  this 
type  of  system  it  is  extremely  easy,  once  the  basic  processing  programs 
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have  been  written,  to  modify  displays  and  to  create  new  ones.   There  are 
no  time  consuming  hardware  modifications  to  be  made.   Of  course  the  main 
disadvantage  is  the  cost  of  the  computer  system.   Once  a  suitable  display 
design  has  been  found,  however,  a  hardware  version  can  be  fabricated. 
Alternatively,  a  time  sharing  educational  system  such  aa  the  PLATO. system  , 
might  be  used  to  allow  access  to  a  large  scale  computer  at  a  minimum  cost. 
If  low  cost  display  units  such  as  the  plasma  panel  (see  Bitzer,  et  al 
[1966]  or  ¥illson[l966] )  become  readily  available  and  are  capable  of  pro- 
ducing the  type  of  displays  needed,  this  latter  implementation  might  be 
an  inexpensive  way  of  providing  a  variety  of  display  types  to  the  various  ■. 
institutions  needing  them. 

k.l     Outline  of  the  Study 

The  development  of  this  study  was  organized  into  the  following  ; 

steps: 

1)  The  development  of  a  basic  subsystem  for  inputting  speech 
signals  into  the  computer.   Because  of  the  slowness  of  the  CDC  1604,  it 
was  not  possible  to  run  the  complete  speech  display  simulation  system  in 
real  time.   As  a  result  the  speech  input  subsystem  has  been  oriented 
around  the  tape  units  as  a  storage  medium.   The  I/O  programs  were  used  to 
read  in  data  from  an  audio  tape  recorder  attached  to  an  A  to  D  converter 
and  to  write  out  the  data  in  a  packed  format  on  magnetic  tape.   This  data 
was  edited  by  the  operator  by  means  of  various  data  manipulation  programs. 
Eventually  the  desired  data  was  copied  to  a  new  tape  complete  with  header 

ocks  describing  the  data.   This  edited  data  tape  was  then  used  as  the 
input  to  the  processing  routines. 

2)  The  development  of  various  speech  processing  routines 
general  enough  to  be  used  by  a  variety  of  display  types.   These  include 
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such  routines  as  peak  detectors,  zero  crossing  detectors,  rectifiers,  fast 

Fourier  transforms,  digital  filters,  etc. 

3)  The  development  of  the  speech  display  routines.   These 
routines  make  use  of  the  speech  processing  routines  plus  other  types  of 
general  routines  to  produce  specific  displays.   The  types  of  displays  are 
explained  in  detail  in  Section  5.   Some  "were  designed  from  descriptions 
in  the  literature  (see  Section  3)  and  others  were  developed  more  or  less 
independently.  As  new  speech  processing  routines  "were  found  to  he  neces- 
sary, they  were  developed  and  added  to  the  programs  written  in  stage  2. 

k)      The  production  of  display  photographs  for  use  in  experi- 
mental comparison  tests.   A  limited  number  of  words  were  picked  and  record- 
ings of  these  words  correctly  spoken  by  several  people  were  made.   After 
being  converted  to  digital  tape,  these  recordings  were  processed  by  the 
various  routines  to  produce  the  desired  displays.   These  displays  were 
photographed  using  a  polaroid  camera  and  each  resulting  picture  was  rephoto- 
graphed  to  produce  a  35  mm.  slide,  which  could  be  shown  to  subjects  by 
means  of  either  a  slide  viewer  or  a  projector. 

5)   Finally,  two  types  of  tests  were  conducted  on  each  of 
several  types  of  displays  to  determine  their  relative  effectiveness  in 
displaying  speech.   A  preliminary  test  was  conducted  for  the  availability 
of  the  proper  information  for  word  discrimination.   The  preliminary  test 
was  a  type  of  concept  attainment  experiment  in  which  the  subjects  must 
try  to  identify  each  word  from  its  display.   The  point  of  the  preliminary 
test  was  to  determine  if  a  given  display  type  presents  the  proper  infor- 
mation for  word  identification.   In  other  words,  is  the  transformation 
appropriate?   Since  it  is  fairly  well  established  that  this  type  of  concept 
identification  task  is  a  hard  (if  not  impossible)  task  in  the  general 


32 

speech  display  case,  this  test  -was.  made  using  a  limited  number  of  words. 

A  final  test  to  determine  the  displays'  usefulness  in  a  com- 
parison situation  such  as  would  exist  in  the  eventual  system  was  also  con- 
ducted.  In  this  test  the  subjects  were  presented  with  pairs  of  photograph 
which  represented  two  different  utterances  as  depicted  by  one  of  the  dis- 
play types.   The  two  utterances  could  be  the  same  word  spoken  by  two  diffe 
ent  people,  different  words  which  sound  similar,  or  a  correctly  and  in- 
correctly pronounced  version  of  the  same  word.   The  subject's  task  was  to 
determine  if  the  two  displays  represented  the  same  word.   After  his  respon 
he  was  told  the  correct  answer.   As  a  further  test  the  subject  was  occasio: 
ally  asked  to  indicate  points  of  similarity  or  difference.   Then  on  the 
basis  of  the  number  and  type  of  errors  made  on  each  display,  a  comparison 
between  the  various  display  methods  was  made. 

With  the  completion  of  the  comparison  tests  the  scope  of  the 
present  study  ended.   There  are  still  other  problems.   In  particular  the 
question  arises  that  even  if  the  subject  can  correctly  detect  a  difference 
between  two  displays,  he  may  not  know  how  to  change  his  pronunciation  to 
make  the  display  of  his  version  of  the  utterance  more  like  the  standard. 
However,  in  order  to  test  out  this  problem  a  real-time  display  is  essentia 
Therefore  for  the  time  being,  this  problem  will  be  postponed. 

In  conclusion  the  goal  of  this  study  was  to  develop  several 
types  of  visual  speech  displays  and  then  perform  comparison  tests  on  them 
to  determine  their  relative  and  absolute  suitability  for  use  as  visual 
speech  feedback  devices. 
h.2     Theoretical  Significance  of  the  Comparison  Tests 

As  was  previously  mentioned,  the  theoretical  basis  for  the  com- 
parison tests  used  in  this  study  comes  from  that  part  of  the  psychological 
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literature,  dealing  with,  cognitiye  processes ?  which-  has. come  .to  be  called 
"concept  identification^1  or  "concept  formation".  The  testa  themselves 
involve  the  establishment  by  the  subject  of  various  response  categories, 
i.e.  the  -words,  based  on  generalized  concepts  "which  must  be  developed  by- 
looking  at  the  various  instances  of  these  categories:  as:  depicted  by  the 
particular  display  type  being  tested.  Xn  order  to  do  this  the  subject  must 
select  those  attributes  from  the  display  instances  which  are  most  relevant 
to  the  des crimination  process  and  determine  how  these  attributes  indicate 
the  proper  response  categories. 

Over  the  past  few  years  there  has  been  a  great  deal  of  discussion 
in  the  literature  of  concept  identification  about  the  exact  method  used  by 
subjects  in  the  development  of  concepts  in  this  type  of  situation.   Restle 
[1962],  Bruner,  Goodnow  and  Austin  [1962]  and  Haygood  and  Bourne  [1965] 
all  discuss  various  types  of  strategies  for  selecting  and  testing  hypotheses 
about  the  cues  which  will  lead  to  a  correct  classification.   Haygood  and 
Bourne  [1965]  break  the  process  down  into  two  problems:   finding  the  attri- 
butes of  the  various  instances  which  are  important  in  determining  the  con- 
cept (s)  and  finding  the  rules  involved  in  combining  the  values  of  these 
attributes.   The  attributes  may  vary  in  their  obviousness  and  the  rules  may 
be  either  simple,  i.e.  merely  the  presence  of  a  particular  value  of  the 
attribute,  or  complex,  i.e.  some  logical  relation  between  several  attributes, 

Bower  and  Trabssso  [1963],  in  discussing  two  category  problems 
(i.e.  the  concept  is  simply  the  presence  or  absence  of  a  particular  value 
of  one  of  the  attributes),  develop  an  expression  for  the  probability  that 
the  subject,  will  focus  attention  on  the  relevant  attribute,  namely: 


a, 

r  - 


3U 
W. 


w  +  w. 

a    1 


where  W  is  the  attention  value  of  the  relevant  attribute  summarizing  all 
a 

Of  the  factors  determining  the  subject's,  selection  of  it  for  testing  and 
¥-  is  the  sum  of  these  values  for  the  irrelevant  attributes.   In  a  more 
complicated  situation,  such  as  the  present  case  of  speech  displays,  there 
are  other  factors  to  be  considered  as  well. 

In  the  first  place,  there  may  be  redundant  attributes  which 
would  help  to  establish  the  response  categories.   These  may  be  wholly 
redundant  or  they  may  be  only  partially  redundant  and  thus  only  help  in  sc 
Of  the  cases.   Secondly  the  rules  involved  in  combining  the  attributes  are 
probably  more  complex  than  the  simple  presence  or  absence  of  a  particular 
value  of  an  attribute.   Some  of  this  complexity  will  be  due  to  the  inhererj 
complexity  of  the  speech  code,  and  some  of  it  will  be  due  to  the  partial 
redundancy  of  some  of  the  attributes.   Finally,  the  fact  that  we  are  work- 
ing in  a  multiple-response  category  situation  will  increase  the  complexity 
of  any  such  formulation. 

As  a  result  of  all  of  this,  it  would  be  very  difficult  to  devel 
any  kind  of  precise  mathematical  formulation  for  the  probability  of  achiev 
concept  discrimination  in  the  present  case,  and  in  fact  this  is  not  really 
necessary.   All  we  actually  need  are  a  few  qualitative  predictions. 

Basically,  the  preliminary  test  is  meant  to  be  a  concept  forma- 
tion situation  in  which  the  subject  must  learn  to  identify  words  based  on 
the  cues  being  presented  by  the  display  type  being  tested.   It  is  hypothe- 
sized that  the  speed  with  which  the  subject  attains  the  "concepts"  of  the 
words  as  represented  by  that  particular  display  is  directly  related  to  the 
probability  of  concept  attainment  after  a  number  of  trials.   This  in  turn 
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is.,  hypothesized  to  he  related  to  the  number  and.  effect  iy  en  ess  of  the  rele- 
yant  cues  and  inversely-  related  to  the  number  of  irreleyant  cues  presented 
by  the  display.   If  the  words  selected  for  display  are  sufficiently  typical 
of  the  normal  speech  sounds  encountered  in  spoken  language,  and  if  several 
speakers  are  used  to  get  a  typical  set  of  speaker  variations,  then,  provided 
that  there  is  a  difference  in  the  effectiveness  of  the  various  displays, 
the  results  should  he  significant.   By  measuring  the  length  of  time  it  takes 
to  achieve  a  given  criterion  of  performance  on  a  particular  display,  we 
should  ohtain  an  indication  of  the  relevance  of  that  display  type  to  the 
problem  of  word  identification. 

The  purpose  of  the  second  test  was  to  determine,  for  each  dis- 
play, the  type  of  variations  of  the  words  which  can  he  accepted  as  unimpor- 
tant.  Since  the  second  test  takes  place  after  the  subject  has  gone  through 
the  first  test  phases,  the  subject  will  have  become  somewhat  proficient 
(hopefully)  at  understanding  the  display.   Thus  this  test  is  akin  to  a  con- 
cept discrimination  task  in  which  the  subject  is  trained  to  make  finer  and 
finer  distinctions. 


Chapter  5 
DISPLAY  DESCRIPTIONS 

The  purpose  of  this  section  is  to  give  a  detailed  description 
of  the  various  types  of  displays  which  have  been  produced  by  the  Speech 
Display  system.   Each  display  will  be  described  separately  in  general 
terms  along  with  the  different  variants  which  are  possible.   Photographs 
of  these  various  displays  will  also  be  given. 

Before  describing  the  speech  display  types  themselves,  it  will 
be  desirable  to  describe  the  two  main  display  packages  which  these 
speech  displays  use,  the  variable-intensity  TV  scan  display  and  the 
continuous  line  display. 

5. 1  Variable-Intensity  TV  Scan  Display 

This  display  program  package  takes  a  two-dimensional  array 
of  intensity  points  and  produces  a  continuously  varying  intensity  display. 
The  programs  interpolate  between  the  points  in  the  array  in  both  dimensions 
and  set  up  a  TV  scan  display  buffer  which  is  plotted  and  photographed  by 
the  system  display  routines.   There  are  a  variety  of  display  choices  which 
can  be  specified  by  the  user: 

1)  The  number  of  points  to  be  interpolated  between  the  array 
entries  in  both  the  horizontal  and  vertical  directions. 

2)  The  distance  between  points  which  are  plotted  by  the  system 
display  routines  (this  will  affect  the  "grain"  of  the  resulting  display). 

3)  The  position  relative  to  the  left-hand  side  of  the  display 
at  which  the  actual  data  will  begin  to  be  displayed.   (This  allows  a  given 
speech  display  to  be  centered. ) 

h)      The  minimum  intensity  below  which  the  data  will  not  be 
displayed.   (This  helps  to  eliminate  low-intensity  clutter  which  takes 
time  to  display  but  which  adds  no  real  information. ) 
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5.2  Continuous  Line  Display 

This  display  package  produces  a  continuous  line  display  using 
either  an  x  vs.  y  type  data  format  or  a  format  in  which  one  data  array 
is  plotted  sequentially  in  the  horizontal  direction.   The  main  display 
options  are  the  maximum  x  and  y  values  and  the  type  of  display. 

5.3  Spectrogram 

At  the  present  time  the  spectrographic  display  is  the  most 

versatile  display  in  the  sense  that  it  can  he  varied  in  the  most  number  of 
ways.   As  described  in  section  3,  it  is  a  linear  time  display  in  which 
frequency  is  plotted  along  the  vertical  axis  and  time  along  the  horizontal 
axis.   The  intensity  of  the  display  at  any  given  point  is  proportional 
to  the  magnitude  of  the  particular  frequency  component  at  the  time 
represented  by  that  point. 

The  actual  frequency  analysis  is  done  using  a  Fast  Fourier  Trans- 
form program  initially  written  by  Gary  Horlick  of  the  Coordinated  Science 
Laboratory  and  subsequently  modified  by  the  author.   The  algorithm  is  a 
variant  of  the  original  Cooley-Tukey  algorithm  (see  for  example,  Cooley 
and  Tukey  [1965],  Gentleman  and  Sande  [1966],  Cochran,  et  al  (1967],  or 
Brigham  and  Morrow  [1967]).   More  recently,  Alan  Oppenheim  [1970]  has 
presented  a  very  good  article  on  the  use  of  the  FFT  in  producing  spectrograms, 

Since  the  FFT  is  a  discrete  transform,  the  output  frequency 
magnitudes  are,  in  effect,  samples  of  the  frequency  spectrum  of  the  data 
being  analyzed.  The  spacing  of  these  frequency  samples  is  determined  by 
the  fundamental  frequency  of  the  time  period  being  analyzed,  and  this  in 
turn  is ■ determined  by  the  number  of  samples  being  processed.  Thus  it  is 
possible  to  decrease  the  spacing  between  frequency  samples  by  increasing 
the  number  of  time  samples  processed.  This  produces  a  more  detailed 
frequency  analysis  but  only  at  the  cost  of  having  a  larger  time  slice. 
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This  effectively  means  that  although  you  have  gained  more  information 
about  the  frequency  analysis ,  you  are  less  sure  about  the  position  in 
time  to  which  it  applies. 

In  addition  to  the  time-frequency  tradeoff,  it  is  also  possible 
to  adjust  the  number  of  frequency  components  to  be  displayed  and  thereby 
vary  the  total  frequency  spread  of  the  display. 

Once  the  frequency  components  have  been  calculated  for  each 
time  slice  in  the  display,  a  linear  normalization  is  performed  on  the 
data  so  that  the  intensity  values  will  be  within  the  range  of  values 
used  by  the  CRT.   The  value  given  to  the  maximum  component  in  the  display 
can  be  adjusted  to  be  greater  than  the  maximum  intensity  which  can  be 
displayed  by  the  CRT.   Since  any  intensity  values  greater  than  the  maximum 
displayable  value  are  truncated  to  the  maximum  intensity  value  by  the 
display  routines,  this  allows  the  user  to  specify  a  value  range  over  which 
he  desires  truncation  of  the  intensity  values.   This  feature  is  valuable 
because  in  any  given  spectrogram  there  are  always  a  few  points  which  are 
way  out  of  line  with  the  rest  of  the  values  and  by  truncating  these  points 
the  remaining  points  can  be  given  a  greater  spread  of  values. 

A  second  form  of  contrast  enhancement  can  be  used,  namely  high 
frequency  emphasis.   This  simply  involves  multiplying  each  frequency 
component  in  every  time  slice  by  a  factor  greater  than  or  equal  to  1  and 
having  the  factor  increase  as  the  frequency  of  the  component  increases. 
In  the  actual  program  the  emphasis  begins  at  around  2000  hz.  and  the  user 
in  effect  controls  the  rate  of  increase  in  the  multiplicative  factor. 

In  addition  to  these  options,  the  spectrographic  display  can  mak 
use  of  the  various  options  available  in  the  variable  intensity  display 
package.   Figures  1  through  3  show  various  examples  of  spectrographic 
displays  using  various  sets  of  parameters. 
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no  truncation 


t>)  medium  emphasis 
no  truncation 


high  emphasis 
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medium  truncation 


medium  emphasis       f ) 
medium  truncation 
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medium  truncation 
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high  truncation 


medium  emphasis 
high  truncation 


high  emphasis 
high  truncation 


Figure  1  Effect  of  Variations  in  High  Frequency  Emphasis 
and  Intensity  Truncation  Using  the  Word  "Shod" 
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Figure  2   Effect  of  Variations  in  Time  Slice  Size 
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Figure  3   Examples  of  the  Spectrographs  Display 
with  Nominal  Parameter  Values 
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5.^  Formant  Ex--Lractin.fi;  Display 

The  formant  extracting  display  is  similar  in  format  to  the 
spectrographic  display.   However,  in  this  type  of  display  the  formants 
are  extracted  from  the  display  data  and  all  other  display  data  in  the 
frequency  regions  of  the  formants  is  suppressed.   This  allows  the  formant 
movements  to  he  seen  more  clearly  and  at  the  same  time  retains  the  high 
frequency  fricative  information. 

The  formant  extracting  process  essentially  takes  the  frequency 
analysis  of  each  time  slice  and  finds  its  major  peaks.   This  involves 
utilizing  a  peak-picking  routine  twice  (see  figures  k   and  5). 
The  first  pass  over  the  frequency  analysis  data  obtains  the  minor  peaks 
which  represent  the  various  harmonics  of  the  fundamental  pitch  frequency. 
The  second  pass  of  this  data  will  obtain  the  peaks  which  can  be  considers; 
to  be  formant  candidates. 

The  four  largest  formant  candidates  are  then  selected  and  analy?a. 
If  the  smallest  candidate  is  less  than  half  the  size  of  the  next  smallest 
it  is  eliminated.   Any  candidates  over  ^000  cps  are  also  eliminated  since 
it  is  unlikely  that  a  true  formant  would  appear  in  that  frequency  region. 

Once  the  unlikely  candidates  are  eliminated,  the  frequency 
analysis  in  the  region  covered  by  the  remaining  formants  is  erased  and 
replaced  by  the  magnitudes  of  the  formants  at  their  corresponding  frequent 
The  results  of  this  type  of  analysis  are  shown  in  figure  6. 

It  should  be  noted  that  this  algorithm  never  determines  which 
formant  is  the  first,  which  is  the  second,  etc.   This  is  a  non-trivial 
problem  since  some  of  the  "formants"  selected  may  still  occasionally  be 
noise  and  quite  often  "real"  formants  will  drop  out  or  merge  for  several 
time  slices.   The  only  way  to  effectively  determine  the  actual  number 
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Figure  k     Effect  of  the  Peak-Picking  Process 
•  on  the  Spectrum  Analysis  of  a  Single  Time  Slice 
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Figure  5  Effect  of  the  Peak-Picking  Process 
on  the  Full  Spectrographs  Analysis  of  the  Word  "Dead" 
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Figure  6   Examples  of  the  Formant  Extracting  Display 
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associated  with  the  formant  would  be  to  keep  a  record  of  the  movements 
over  time  and  on  the  basis  of  this  record  determine  which  peaks  in  a 
given  time  slice  correspond  to  each  formant. 

5.5  Zero— Crossing    Display 

The  zero-crossing  display  is  a  linear  time  display  in  which 
the  frequency  equivalent  to  the  zero-crossing  rate  is  plotted  on  the 
vertical  axis  and  time  on  the  horizontal  axis.   The  speech  input  is 
fed  to  four  digital  filters,  the  outputs  of  which  are  then  analyzed  to 
determine  their  zero-crossing  rates.   A  single  point  is  plotted  for  each 
filter  output,  the  magnitude  of  the  point  being  proportional  to  the 
magnitude  of  the  output  of  the  corresponding  filter. 

The  frequency  regions  have  been  chosen  so  as  to  approximate 
the  regions  covered  by  the  first,  second,  and  third  formant,  with  the 
fourth  region  being  a  high  frequency  region  for  fricatives  or  other 
noise-like  sounds. 

Examples  of  this  type  of  zero-crossing  display  are  shown  in 
figure  7. 

5. 6  Zero-Crossing  vs.  Amplitude  Envelope 

This  display  is  a  simulation  of  the  display  described  by 
Pyron  and  Williamson  [1965].   There  are  actually  two  variants,  one  using 
the  zero-crossing  rate  of  the  original  speech  signal,  Z  ,  and  the  other 
using  the  zero-crossing  rate  of  the  derivative  of  the  speech  signal  Z„. 
(This  latter  signal  can  also  be  thought  of  as  the  "zero  slope"  or 
maximum-minimum  rate).   One  of  these  two  signals  is  plotted  against  the 
amplitude  envelope  of  the  speech  signal  to  produce  an  x-y  type  speech 
display. 

A  block  diagram  showing  the  production  of  these  two  display 
variants  is  shown  in  figure  8.   Note  that  in  producing  the  y  input, 
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Figure  7   Examples  of  the  Zero-Crossing  Display 
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i.e.,  the  Z  or  Z  signal,  the  differentiator  is  omitted  if  a  Z  signal 
is  to  be  used.   The  threshold  detector  is  used  as  a  blocking  device  to 
inhibit  the  display  when  no  signal  is  present. 

The  display  itself  has  been  adjusted  so  that  the  zero-crossing 
rate  axis  varies  from  0  to  6000  zero-crossings  per  second  (equivalent  to 
0  to  3000  cps.)  as  was  described  in  Pyron  and  Williamson  [1965].  However, 
there  is  no  3KC  upper  limit  bandlimiting  as  was  true  in  their  case.  Thus 
occasionally  zero-crossing  rates  greater  than  a  3KC  equivalent  may  be 
encountered.   In  this  case  the  display  routine  truncates  the  display  in 
the  vertical  direction.   The  horizontal  axis  has  been  adjusted  so  that 
the  maximum  amplitude  coming  in  on  the  A  to  D  converter  will  cause  a  full- 
scale  deflection. 

Examples  of  the  Z-.  and  z?  vs.  amplitude  envelope  displays  are 
given  in  Figures  9  and  10,  respectively. 
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Figure  9 


Examples  of  the  Z  vs.  Amplitude  Envelope  Display 
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Chapter  6 
SPEECH  DISPLAY  SIMULATION  SYSTEM 

The  Speech  Display  Simulation  can  be  divided  into  four  main 
areas:   the  common  data  base,  the  command  processor,  the  speech  display 
routines,  and  the  various  subprocessing  routines. 
6.1  The  Common  Data  Base 

The  common  data  base  consists  of  the  input  speech  data  buffer, 
BUFF,  the  output  display  data  buffer,  FINT,  the  CRT  display  command 
buffers,  ISCOPE  and  ISC0P1,  and  all  of  the  constants  and  variables 
used  to  control  these  buffers.   These  buffers  and  variables  are  all 
kept  in  COMMON  storage.   The  problem  of  keeping  the  COMMON  declaration 
in  each  subroutine  identical  is  handled  by  means  of  the  CSL  FORTRAN 
title  feature.   This  extension  of  the  FORTRAN  language  allows  the  pro- 
grammer to  specify  FORTRAN  statements  which  will  then  appear  in  every 
program  in  which  the  statement  TITLE*  appears.   Any  type  of  valid  FORTRAN 
statement  can  be  put  in  the  title  and  thus  the  whole  common  data  base 
need  only  be  written  down  once. 

The  common  data  base  has  several  key  features.   Since  the 
CDC  160U  was  not  fast  enough  to  process  speech  input  in  real  time,  it 
was  necessary  to  use  digital  tape  for  storing  the  input  speech  data. 
As  a  result,  it  became  unnecessary  to  provide  a  full-sized  buffer  to 
contain  a  complete  speech  utterance.   Instead,  the  floating  point 
buffer,  BUFF,  is  used  to  contain  only  that  portion  of  the  data  which 
is  of  current  interest. 

As  can  be  seen  in  figure  n5  there  are  two  corresponding 
pointers  for  the  data  tape  and  the  buffer,  BUFF.   ISAMP  is  the  main 
data  pointer  and  selects  the  initial  sample  of  a  set  of  data  points 
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Figure  -11   Relationship  Between  ISAMP  and  ISAMPB 
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from  the  complete  set  of  data  (consisting  of  many  speech  utterances) 
on  the  data  tape.   Its  value  may  range  up  to  around  900000,  since  this 
is  the  approximate  number  of  packed  sample  points  which  can  be  written 
on  a  single  tape.   ISAMPB  corresponds  to  ISAMP  in  that  it  points  to  the 
same  data  as  ISAMP  but  it  refers  to  the  data  as  it  happens  to  be  current] 
loaded  in  BUFF.   Thus  ISAMPB  only  varies  from  0  to  the  maximum  length 
of  BUFF  (currently  3000  words). 

The  display  generating  routines  are  free  to  move  ISAMP  up  and 
down  the  data  tape  whenever  they  wish.   Before  they  utilize  this  new 
data  position,  however,  they  must  call  the  subroutine  ADJUS2.   This 
subroutine  checks  BUFF,  and  if  the  data  corresponding  to  the  new 
position  of  ISAMP  is  not  currently  in  BUFF,  it  moves  the  tape  forward 
or  backward  until  it  can  load  BUFF  with  the  proper  data  and  converts 
it  to  floating  point.   Once  BUFF  is  made  to  contain  the  desired  data, 
ADJUS2  sets  ISAMPB  so  that  it  can  be  used  as  an  index  for  BUFF  to  obtain 
the  desired  data.   It  is  this  pointer  that  the  speech  processing  pro- 
grams use  to  obtain  the  speech  data. 

The  second  feature  of  the  common  data  base  involves  the  FINT 
array.   This  array  is  basically  a  two-dimensional  array  containing 
intensity  values  with  its  dimensions  corresponding  to  frequency  vs. 
time.   However,  it  was  felt  that  it  would  be  much  more  convenient  to 
be  able  to  vary  the  relative  maximum  sizes  of  these  two  dimensions  even 
while  the  total  length  of  the  array  remains  fixed.   This  is  especially 
nice  for  short  speech  samples  in  which  it  is  desired  to  have  a  spectro- 
graphic  analysis  with  a  very  small  increment  between  frequencies,  since 
in  this  case  the  maximum  index  for  the  frequency  dimension  must  be 
increased.   Unfortunately  FORTRAN  has  no  provision  for  dynamically 
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assigning  array  dimensions.   Therefore  it  was  decided  to  require  each 
program  using  FINT  to  calculate  its  own  subscripts  using  a  frequency 
maximum  index,  IFMAX,  which  could  be  dynamically  chosen  by  the  operator. 
At  first  this  seemed  like  a  lot  of  extra  work  but  the  technique  is 
relatively  straightforward  and  in  many  cases  it  resulted  in  a  consider- 
able increase  in  speed  due  to  the  lamentably  inefficient  calculations 
used  by  CSL  FORTRAN  to  calculate  subscripts.   This  was  especially  true 
in  loops  since  the  compiler  makes  no  optimizing  attempts. 
6.2  The  Command  Processor 

The  command  processor  is  the  heart  of  the  interactive  communica- 
tion with  the  system.   It  gives  the  operator  the  ability  to  change  the 
values  of  the  system  constants  and  variables  and  to  call  the  various 
display  routines.   In  addition,  he  can  dump  out  the  contents  of  the 
various  arrays  and  variables.   The  command  processor  includes  the  main 
program  and  the  subroutines  directly  called  by  it,  namely  INPTCM,  which 
reads  each  command  with  its  parameters  and  the  various  command  identifying 
subroutines ,  which  determine  the  command  and  perform  the  requested 
operations.   At  the  present  time,  INPTCM  accepts  only  fixed  format 
commands.   However,  it  is  hoped  that  it  will  eventually  be  possible 
to  expand  it  to  a  free  format  subroutine. 

The  command  identifying  operations  have  been  kept  as  general 
as  possible.   The  commands  are  grouped  together  according  to  function 
into  subroutines.   Each  subroutine  has  the  task  of  identifying  those 
commands  associated  with  it  and  then  executing  them.   Since  the  sub- 
routines .are  independent  of  c.ie  another  it  is  relatively  easy  to 
expand  the  command  set  simply  by  adding  commands  to  the  relevant  sub- 
routine or  writing  a  completely  new  subroutine  and  adding  a  call  to  it 
in  the  main  program. 
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The  conventions  for  intercommunication  are  relatively  simple 
and  yet  allow  a  high  degree  of  flexibility.   Each  subroutine  accepts 
as  parameters  a  character  variable  containing  the  command  and  as  many 
of  the  input  parameters  read  in  by  INPTCM  as  may  be  necessary.   If  the 
subroutine  determines  that  the  command  is  not  one  for  which  it  is 
responsible,  it  simply  returns.   If  the  command  is  one  of  the  subset 
of  commands  which  it  can  execute,  it  performs  the  required  operations. 

Then  before  returning,  it  sets  the  command  variable  to  zero  to  indicate 
to  the  main  program  that  the  command  was  executed.   Thus  after  calling 
all  of  the  command  identifying  subroutines,  the  main  program  merely  needs 
to  check  the  command  variable  for  zero  to  see  if  it  was  executed.   If 
it  is  not,  then  the  main  program  types  out  a  message  saying  that  the 
command  was  not  recognized. 

Note  that  this  technique  presents  a  wealth  of  opportunities. 
For  example,  a  command  identifying  program,  as  part  of  its  command 
execution  step,  could  load  the  command  variable  with  a  new  command 
instead  of  loading  it  with  zero.   This  command  could  then  be  executed 
by  some  subsequent  command  identifying  program.   This  in  fact  has  been 
done  in  the  present  system.   To  extend  the  idea  even  more,  the  command 
variable  could  be  generalized  to  a  push  down  stack.   Then  you  could 
have  complex  commands  which  actually  represent  a  series  of  simpler 
commands.   The  execution  of  the  complex  command  would  consist  of 
expanding  it  into  the  simpler  series  of  commands  and  pushing  these  into 
the  stack.   The  main  program  would  pop  the  stack  each  time  a  command 
was  completed  and  then  repeat  the  identification  and  execution  process 
for  the  newly  exposed  command  at  the  top  of  the  stack.   The  main  program 
would  only  return  when  the  stack  was  empty.   The  key  point  to  note  (and 
the  one  which  illustrates  the  general  philosophy  of  the  system)  is  that 
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this  stacking  process  could  be  added  without  modifying  the  programs 
which  already  exist. 

Some  of  the  commands  which  can  be  executed  by  the  system  are 
given  in  Table  2.     In  addition  to  being  able  to  run  the  various 
display  programs  and  diagnostic  routines  and  to  manipulate  the  data 
tapes ,  the  command  system  allows  the  operator  to  change  many  of  the 
system  variables.  This  allows  him  to  easily  modify  the  various 
displays.   It  also  causes  a  certain  number  of  problems  due  to  the 
manner  in  which  some  of  the  system  variables  and  constants  interact. 
An  example  of  this  problem  occurs  in  the  spectrographic  display,  where 
the  number  of  samples  to  be  processed  per  time  slice  fixes  the  interval 
between  frequency  coefficients  and  vice  versa. 

The  solution  to  this  problem  was  to  allow  the  user  to  set 
certain  parameters  independently  and  then  have  the  system  calculate 
the  effect  of  these  choices  on  the  other  dependent  parameters  and 
print  them  out  (this  operation  is  performed  by  the  FINI  subroutine). 
Thus,  for  example,  the  operator  can  choose  the  desired  number  of  data 
samples  he  wants  processed  per  time  slice  in  the  spectrographic  display 
and  the  system  will  respond  by  indicating  the  frequency  increment 
between  coefficients  and  the  total  frequency  range  which  will  be  dis- 
played given  the  current  value  of  IFREQ. 
6.3  The  Speech  Display  Routines 

The  speech  display  routines  consist  of  the  programs  used  to 
simulate  the  various  speech  displays.   These  programs  manipulate  the 
common  data  base  using  the  various  subprocessing  routines  to  produce 
the  displays  desired. 
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Command  Operation 

BEGN 

TAPCOM 

Rewind  data  tape  &  initialize  system 

BEGN 

TAPCOM 

BUFF 

DIAGNG 

C 

DIAGNG 

COPY 

DATGCL 

DISP 

DIAGNG 

F 

TAPCOM 

FINIS 

PROSCL  & 

DATGCL 

FIND 

TAPCOM 

FORME 

PROSCL 

FOWD 

TAPCOM 

HEADT 

TAPCOM 

HIEMP 

PROSCL 

IN  ITT 

PROSCL  & 

DATGCL 

INTAP 

DIAGNG 

IWIDE 

TAPCOM 

LOCA 

TAPCOM 

MOVE 

TAPCOM 

NORMF 

PROSCL 

OBTAI 

DATGCL 

PHOTO 

PROSCL  & 

DATGCL 

PYRON 

PROSCL 

READF 

DIAGNG 

REWIN 

DIAGNG 

SAVEF 

DIAGNG 

SPDIS 

PROSCL 

SPECT 

PROSCL 

STAND 

PROSCL 

THRSP 

DATGCL 

WHATN 

PROSCL  & 

DATGCL 

ZEROC 

PROSCL 

Print  out  buffer  contents 

Next  input  will  be  a  comment 

Copy  data  tape 

Display  buffer  contents  on  CRT 

Short  form  of  FOWD  =  1000 

Calculate  dependent  variables  &  turn  off 

Search  data  tape  for  specified  speech  wor 

Call  FORMEX  display  routine 

Move  data  tape  forward  NVAL  samples 

Process  header  block 

Add  high  frequency  emphasis  to  display  da, 

Initialize  system  variables 

Assign  input  command  medium 

Assign  window  size  for  data  tape  display 

Print  out  value  of  data  pointer 

Move  data  pointer  to  NVAL 

Normalize  display  data 

Use  A  to  D  convertor  to  obtain  speech  dat 

Take  picture  of  last  display 

Call  PYRON  display  routine 

Read  out  display  data  stored  on  tape  unit 

Rewind  tape  unit  NVAL 

Write  display  data  on  to  tape  unit  3 

Display  the  display  data  array  on  the  CRT 

Call  SPECTO  display  routine 

Produce  a  standard  Spectrograph  display 

Call  THRSPIC  data  processing  routine 

Call  WHATNOW  subroutine 

Call  ZEROC  display  routine. 


Table  2   Commands  Executed  by  Speech  System 
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There  are  two  "basic  formats  for  the  output  data.   The  three 
dimensional  linear  time  displays  are  generally  represented  as  a  two- 
dimensional  FORTRAN  array  (stored  in  FINT)  with  each  element  containing 
a  quantity  representing  the  intensity  of  the  corresponding  point  on  the 
display.   The  display  routines  can  then  normalize  the  data  (performing 
such  operations  as  high  frequency  emphasis,  if  desired),  interpolate 
between  data  points  and,  produce  a  smoothly  varying,  multi-intensity 
level  display. 

The  x-y  type  of  displays  are  represented  as  two  arrays  of 
the  corresponding  x  and  y  coordinates  of  successive  points  in  the  dis- 
play.  These  points  can  then  be  displayed  as  a  continuous  line  using 
other  system  display  routines.   In  addition  other  varients  can  be 
produced.   In  particular,  a  trivial  modification  of  the  above  display 
program  allows  a  single  variable  array  to  be  plotted  against  time 
(i.e.  successive  values  of  the  array  are  plotted  vs.  equidistant 
intervals  on  the  x  axis). 
6.k     The  Subprocessing  Routines 

The  subprocessing  routines  consist  of  the  programs  which  are 
used  to  perform  various  operations  and  transformations  of  data.   Each 
routine  performs  a  single  type  of  operation  and  might  be  used  in  the 
construction  of  several  different  displays. 

In  order  to  insure  their  flexibility  of  use,  the  subprocessing 
routines  have  all  been  programmed  to  conform  to  a  certain  general  form. 
In  particular,  each  program  receives  as  its  input  a  data  array  and  a 
variable  indicating  the  number  of  points  to  be  processed.   The  output 
may  or  may  not  be  an  array.   If  it  is  an  array  and  if  the  output  array 
contains  the  same  number  of  points  as  the  input,  the  program  is  written 
so  that  the  same  array  can  function  as  both  input  and  output,  if  desired. 
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If  the  number  of  points  in  the  output  array  is  different  from  the 
number  of  input  points,  this  number  is  specified  as  an  output  parameter. 

In  general,  all  intermediate  data  arrays  used  in  the  processing 
of  data  in  the  subprocessing  routines  are  specified  as  parameters.   This 
allows  the  calling  programs  to  have  complete  control  over  the  storage 
allocation  of  arrays  and  results  in  a  considerable  savings  in  space. 

In  order  to  avoid  the  variety  of  problems  created  by  passing 
subroutine  parameters  through  COMMON,  this  practice  was  generally  not 
used.   By  passing  all  of  the  parameters  explicitly,  the  routines  are 
easier  to  understand  and  have  many  fewer  mysterious  side  effects.   There 
are  two  exceptions  to  this  rule,  however.   One  is  that  certain  system 
constants  were  allowed  to  be  obtained  directly  from  COMMON,  e.g.  the 
sampling  frequency,  etc.   In  general,  the  variables  which  are  passed  in 
this  manner  are  those  whose  use  and  meaning  are  unlikely  to  change  as 
the  system  matures.   This  lowers  the  probability  of  having  to  rewrite 
the  subprocessing  routine  later  on.   The  second  exception  involves  short 
subroutines  which  are  used  very  often,   i.e.  in  "tight  loops".   In  such 
cases  the  overhead  involved  in  handling  explicit  parameters  becomes 
excessive  so  that  passage  through  the  COMMON  area  becomes  necessary. 
6. 5   Basic  System  Principles 

As  the  Speech  Display  Simulation  System  developed,  certain  key 
principles  were  developed  as  follows: 

l)   The  common  data  base,  command  processor  and  speech  dis- 
play routines  should  be  basically  machine  independent.   This  means  that 
they  should  be  written  in  standard  FORTRAN  as  much  as  possible  and  any  use 
of  CSL  FORTRAN  extensions  should  be  fully  documented  by  means  of  comment 
statements  in  the  code  itself. 


6l 

2)  The  subprocessing  routines  may  be  written  in  machine  language 
or  in  a  combination  of  FORTRAN  and  machine  language  as  is  allowed  in  the  CSL 
FORTRAN  system.   However,  this  should  only  be  done  if  a  significant  speedup 
in  time  or  savings  in  space  results  or  if  it  is  necessary  to  perform  some 
special  function,  such  as  communicating  with  the  CRT  display  unit.   In 
either  case  all  occurrences  of  machine  code  should  be  explained  both  in  the 
overall  sense  and  at  the  detailed  instruction  level  by  comments  within  the 
program. 

3)  Test  programs  used  to  check  out  the  various  subprocessing 
routines  are  not  normally  to  be  loaded  with  the  rest  of  the  system.   They 
are  kept  on  the  library  tape,  however,  so  that  when  needed,  they  may  be 
easily  loaded  by  making  a  call  request  to  the  CSL  Operating  System.   These 
programs  should  be  well  commented  with  exact  instructions  on  their  use  since 
it  is  easy  to  forget  their  operation  within  a  matter  of  weeks  if  they  are 
not  used  regularly. 

The  complete  descriptions  of  the  various  programs  used  in  the 
Speech  Display  system  are  given  in  Nordmann[l97l]  along  with  the  program 
listings,  test  programs  and  sample  outputs. 


Chapter  7 
RESULTS 

The  basic  simulation  system  has  worked  quite  well  and  proved 
quite  adaptable  as  time  went  by.   The  major  problem  with  the  system  at  the 
present  time  is  the  amount  of  inconvenience  involved  in  producing  a  digital 
data  tape  which  can  be  used  by  the  processing  routines.   Although  the  re- 
cording and  playback  through  the  A  to  D  convertor  is  easy  enough,  the  de- 
cision about  what  to  save  and  put  on  the  permanent  data  tape  must  be  done 
on  an  individual  basis  by  the  operator.   There  are  routines  which  can  be 
used  to  assist  in  this  operation,  such  as  THRSPIC  which  will  print  out,  for 
each  block  on  a  tape,  the  number  of  samples  above  any  particular  threshold 
value  chosen  by  the  operator.   However,  the  basic  decision  as  to  where  the 
word  starts  and  ends  must  be  made  by  the  operator.   In  a  real  time  system 
it  would  be  possible  to  get  around  such  a  problem  by  using  a  push  button 
to  indicate  when  you  want  the  computer  to  "listen".   At  any  rate,  at  the 
present  time,  this  task  is  somewhat  tedious.   Once  it  is  accomplished,  how- 
ever, the  production  of  the  displays  is  fairly  simple. 

The  testing  of  the  displays  turned  out  to  present  quite  a  few 
problems,  mostly  revolving  around  the  expense  of  a. really  comprehensive  tes 
ing  procedure.   In  the  end  it  was  necessary  to  restrict  the  amount  of  test- 
ting  done  with  the  result  that  the  tests  which  were  performed  cannot  in  any 
way  be  considered  definitive.   However,  several  procedural  variations  were 
tested  and  certain  generalizations  can  be  made  about  the  restricted  tests 
which  were  perfromed. 

In  the  end  it  was  only  possible  to  get  two  subjects  who  were 
able  to  complete  a  full  set  of  tests  and  even  these  subjects  were  not  able 
to  run  a  full  series  on  every  display  type.   In  addition  several  other 
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subjects  completed  various  parts  of  the  test  series  for  specific  display- 
types.  As  a  result  it  is  impossible  to  make  any  statistically  significant 
generalizations  about  the  results  and  no  type  of  statistical  analysis  was 
eyen  attempted.   It  is  hoped  however  that  the  results  will  prove  useful  in 
indicating  the  types  of  tests  which- might  be  useful  In  the  future. 

7.1  Recordings 

The  first  area  which  became  restricted  was  the  recorded  data  it- 
self.  In  order  to  minimize  the  number  of  utterances  to  be  processed,  the 
test  vocabulary  was  restricted  to  the  ^0  words  listed  in  Table  3. 
The  words  were  chosen  so  as  to  give  a  distribution  over  the  full  range  of 
vowel  sounds  and  at  the  same  time  allow  maximum  testing  between  words  differ- 
ing by  only  a  single  phoneme.   Four  speakers  were  used,  three  female  and 
one  male,  to  produce  a  total  of  160  utterances.   It  was  also  intended  to  use 
a  set  of  recordings  of  the  Modified  Rhyme  Test  (see  Kreul,  et.  al  [1968]  or 
Beyer,  et.  al.  [1969])  produced  by  the  Stanford  Research  Institute  and 
available  from  K-G  Recording  Service,  U311  Miranda  Ave.,  Palo  Alto,  California. 
These  recordings  were  originally  produced  to  be  used  in  speech  discrimination 
tests  but  they  were  felt  to  be  appropriate  for  the  present  purpose.   Unfor- 
tunately a  variety  of  equipment  difficulties,  some  of  which  were  never  solved, 
prevented  their  conversion  to  digital  tape.   The  result  was  that  the  number 
of  utterances  available  for  the  second  type  of  test  was  not  really  large 
enough. 

The  recordings  of  the  UO  word  list  were  produced  in  a  quiet  room 
using  untrained  friends  of  the  author  as  speakers.  The  equipment  used  con- 
sisted of  an  Allied  M3310  cardioid  microphone  attached  to  one  channel  of  an 
Allied  T-1070  stereo  tape  recorder.  The  use  of  untrained  speakers  produced 
one  rather  severe  problem  which  was  not  discovered  until  several  trial  test 
runs  had  been  performed,  namely  the  words  were  not  all  enunciated  clearly. 
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shin 

beet 

dead 

hag 
sod 

four 

moh 

guff 

ted 

sore 

thin 

shod 

peat 


noh 

zed 

hang 

cuff 

June 

thor 

cage 

lynn 

pang 

vile 

wage 

said 

loon 

chuck 


gin 
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This  caused  confusions  "between  certain  particular  utterances  by  certain 
speakers  independent  of  the  type  of  display  used  since  the  recordings  them- 
selves were  ambiguous.   The  effect  of  this  problem  -will  be  discussed  further 
in  the  subsections  concerning  the  actual  test  results. 
7.2  Data  From'the  First  Test 

As  described  in  Section  k,   the  first  test  was  intended  to  help 
determine  if  it  was  possible  to  extract  the  necessary  information  from  a 
given  type  of  display  to  identify  different  words  consistantly.   It  was 
also  intended  to  give  a  measure  of  the  relative  efficiency  with  which  the 
various  display  types  performed  this  task  by  measuring  the  length  of  time 
needed  to  reach  a  certain  proficiency  with  the  display. 

The  test  items  for  the  first  test  were  selected  from  the  list 
of  kO   words  which  were  spoken  by  the  k   speakers.   Two  separate  groups  of 
items  were  used;  the  first  (test  la)  consisting  of  the  words  zed,  said,  vile, 
file,  dame,  and  tame  and  the  second  (test  lb)  consisting  of  the  words  cuff, 
guff,  mob,  knob,  shod,  sod,  ned  and  ted.   The  words  in  the  two  groups  were 
chosen  so  as  to  provide  pairs  of  words  which  might  be  easily  confused  if  the 
displays  were  not  in  fact  providing  the  proper  cues.   Unfortunately  with  the 
limited  amount  of  testing  which  could  be  done,  it  was  not  possible  to  test 
for  the  full  range  of  confusions  between  all  the  various  phonemes. 

The  procedure  for  the  first  test  involved  showing  the  subject 
slides  of  the  displays  produced  by  a  particular  display  type  and  having  the 
subject  try  to  determine  which  word  was  being  displayed.  When  the  subject 
responded,  he  was  told  whether  or  not  his  response  was:  correct  and  if  not, 
what  word  was  actually  being  displayed.   Initially  the  subject  was  allowed  to 
look  for  five  minutes  at  a  labelled  sheet  containing  pictures  of  all  the 
slides  in  the  test.   Then  the  complete  set  was  shown  to  him  one  at  a  time  for 
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as  many  times,  as  was  necessary  for  it  to  he  learned.   During  the  test  the 
suhject  was  allowed  to  use  a  written  list  containing  the  words  in  the  grout 
heing  displayed. 

Measurements  were  taken  of  the  numher  of  trial  sets  necessary 
to  reach  the  criterion  level  of  response.   This  level  was  loosely  defined 
as  the  point  at  which  the  suhject  hegan  to  level  off  in  improvement  and 
started  making  a  more  or  less  consistant  set  of  mistakes.   It  was  more 
specifically  specified  as  four  consecutive  trial  sets  in  which  the  number 
of  responses  did  not  vary  by  more  than  10$.   Tables  h,    5,  and  6  give  the 
learning  rates  of  each  subject  for  the  spectrograph! c ,  zero-crossing,  and 
formant  extracting  displays,  respectively,  in  terms  of  the  number  of  trial 
sets  necessary  to  reach  the  criterion  run  and  the  average  percentage  correc 
during  the  criterion  run. 

Confusion  matrices  were  also  constructed  using  the  test  results. 
By  keeping  the  effects  of  the  various  speakers  separate  from  one  another, 
it  was  possible  to  determine  effects  which  might  be  due  to  a  single  speaker 
alone.   Tables  7  through  19  give  the  confusion  matrices  for  each  subject 
during  their  criterion  runs,  arranged  in  order  of  the  type  of  display.  Each 
box  in  each  matrix  has  room  for  five  numbers.   The  upper  and  lower  left 
hand  corners  contain  the  number  of  times  a  particular  response  was  given 
for  display  instances  of  the  particular  word  as  it  was  spoken  by  speakers 
a  and  b  respectively.   The  upper  and  lower  right  hand  corners  contain  the 
number  of  responses  for  instances  involving  speakers  c  and  d  respectively. 
The  number  in  the  center  position  is  simply  the  sum  of  the  numbers  in  the 
four  corners  and  represents  the  total  number  of  times  a  particular  response 
was  given  to  a  display  instance  representing  the  particular  word  irrespectivt 
of  which  speaker  pronounced  it. 
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Table  k     Learning  Rates  for 
Spectrograph! c  Display 
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Table  5  Learning  Rates  for 
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As  might  be  expected,  the  small  number  of  subjects  has  caused 
a  great  deal  of  confounding  of  data  since  many  of  the  possible  sources  of 
variance  could  not  be  balanced.   In  particular,  the  order  in  which  a  sub- 
ject learned  the  displays  and  the  order  in  which  the  parts  of  the  first 
test  were  given  could  not  be  varied  in  such  a  way  as  to  cancel  any  possible 
variance  which  might  be  due  to  learning  effects  (across  displays  as  well 
as  during  the  learning  of  a  single  display).   As  far  as  the  learning  data 
is  concerned,  there  is  a  contradiction  between  the  two  parts  of  the  first 
test  which  were  performed  on  the  different  display  types.   The  number  of 
sets  necessary  to  reach  the  criterion  run  and  the  percentage  correct  dur- 
ing the  criterion  run  as  recorded  during  test  la  would  seem  to  indicate 
that  the  spectrographic  display  was  easier  to  learn  than  the  zero-cross- 
ing display.   This  same  data  on  test  lb,  however,  tends  to  indicate  the 
opposite.   The  most  likely  explanation  appears  to  be  that  the  differences 
in  both  cases  are  not  large  enough  to  be  statistically  significant  given 
the  small  amount  of  data  available. 

The  confusion  matrix  data  shows  several  interesting  points.   In 
test  la,  there  are  very  few  confusions  outside  of  the  three  basic  word 
pairs,  i.e.  zed-said,  vile-file,  and  dame-tame.   This  could  be  attributed 
to  the  fact  that  all  three  displays  were  able  to  satisfactorily  distinguish 
vowels.   More  probably,  however,  it  is  due  to  the  fact  that  the  word  pairs 
picked  for  this  test  have  many  differences  among  themselves  and  thus  there 
are  many  cues  available  with  which  to  distingish  them.   A  much  more  selec- 
tive test  could  have  been  devised  if  the  word  pairs  had  been  more  similar 
in  their  phonemic  structure,  e.g.  if  they  all  ended  in  the  same  phoneme 
and  used  the  same  middle  vowel. 

A  slight  example  of  the  type  of  results  which  this  improvement 
might  produce  is  shown  in  the  confusion  matrices  for  test  lb  for  the 


8k 

spectrographic  and  zero-crossing  displays.   Both  subject  A  and  subject  B  hat 
a  certain  amount  of  trouble  distinguishing  between  the  words  "ned"  and  "mob' 
in  the  spectrographic  display  (refer  to  tables  12  and  13).   However,  there 
was  no  such  problem  with  the  zero-crossing  display  (see  tables  lb  and  17). 

As  can  be  seen  from  the  various  confusion  matrices,  there  does 
appear  to  be  a  differential  effect  in  the  confusions  of  some  of  the  test 
words  based  on  which  speaker's  recordings  were  being  used,  e.g.  "zed"  in 
the  spectrographic  matrices  for  subjects  C,  D,  and  E  is  mistaken  for  "said" 
much  more  often  in  the  case  of  speaker  c  than  for  any  other  speaker.   It 
turned  out  that  in  the  original  recording  it  is  in  fact  rather  difficult  to 
determine  whether  the  word  is  a  "zed"  or  a  "said".   The  problem  comes  when 
we  note  that  subjects  A  and  B  did  not  make  this  confusion. 

This  contradiction  was  eventually  resolved  by  a  comment  one  of 
the  subjects  made  in  regard  to  another  similar  situation,  namely  the  word 
"vile"  spoken  by  speaker  b.   This  case  was  one  in  which  there  was  a 
definite  problem  of  consist ant  classification,  but  the  particular  display 
was  so  strikingly  different  that  the  subject  simple  put  it  in  a  class  by  it- 
self and  after  missing  it  once,  he  never  misclassified  it  ^gain.   This  £ 
effect  probably  occurred  in  other  cases  as  well  and  if  common,  it  would 
not  only  obscure  the  effects  of  poor  pronunciation,  but  it  would  also  tend 
to  invalidate  the  tests,  since  the  subjects  would  be  memorizing  specific 
instances  instead  of  general  identifying  principles.   The  main  way  to 
correct  this  problem  would  be  to  have  more  speakers  and  have  several 
different  examples  from  each  speaker.   Then  by  having  successive  test  sets 
composed  of  different  instances,  the  subjects  would  never  be  able  to  rnemor 
ize  specific  instances. 
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7.3  Data  from  the  Second  Test 

The  second  test  was  intended  to  be  a  closer  approximation  to  the 
final  learning  situation  since  it  would  involve  a  comparison  between  two 
displays  shown  simultaneously.   Its  purpose  was  to  obtain  more  detailed 
data  on  the  effectiveness  of  the  displays  and  on  the  tolerances  which 
were  involved  in  each  type. 

Unfortunately,  the  Modified  Rhyme  Test  recordings  were 
found  to  be  defective  when  played  through  the  digital  conversion  apparatus 
of  the  display  system.   Thus  it  was  necessary  to  use  the  same  set  of 
recordings  as  in  the  first  test.   But  since  there  were  not  nearly  enough 
instances  in  these  recordings  for  a  complete  test,  only  a  single  test 
involving  comparisons  between  seven  words  was  attempted  ("zed",  "said", 
"ned",  "ted",  "sod",  "shod",  and  "dead"). 

The  actual  procedures  used  in  the  test  became  rather  complex 
due  to  some  of  the  external  restrictions  placed  on  the  experiment.   Due 
to  time  and  cost  constraints,  only  one  slide  of  each  display  instance  was 
available.   Thus  an  elaborate  scheme  had  to  be  worked  out  whereby  all 
possible  comparisons  of  different  instances  could  be  performed  with  a 
minimum  amount  of  slide  shuffling  between  the  two  projectors.   This  was 
done  by  dividing  the  slides  into  two  groups  and  working  out  all  possible 
comparisons  between  the  instances  in  the  two  groups.   In  order  to  keep  the 
expectations  of  the  two  possible  responses  ("same"  and  "different") 
equal,  it  was  necessary  that  approximately  half  of  the  matches  in  each  set 
be  the  same  word. 

The  various  possible  comparisons  were  written  on  small  index 
cards  and  shuffled  to  give  a  semi-random  ordering.   Then  the  two  sets  of 
cards  were  placed  in  their  respective  projectors.   One  set  was  arranged  so 
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that  the  experimenter  could  project  the  slides  in  any  arbitrary  order.   The 
other  set  was  shuffled  and  then  displayed  one  at  a  time  by  the  subject  in 
sequence  after  the  experimenter  first  noted  down  their  order  on  a  piece  of 
paper.   As  the  subject  projected  each  of  his  slides  in  turn,  the  experimen- 
ter would  pick  the  corresponding  slide  from  his  projector  based  on  the 
current  index  card  notations  being  used  and  project  it  next  to  the  first 
slide.   The  subject  would  then  respond  "same"  or  "different",  the  experi- 
menter would  answer  "right"  or  "wrong",  and  then  both  would  go  to  the  next 
slide  pair.   When  the  set  was  completed,  the  subject's  slides  would  be 
shuffled,  the  experimenter  would  select  a  new  set  of  instances  to  match 
and  the  process  would  repeat. 

Since  there  were  only  single  copies  of  each  slide  it  was  neces- 
sary to  rearrange  the  two  display  sets  periodically  to  match  other  combin- 
ations which  could  not  be  obtained  using  the  previous  set  divisions.   By 
having  two  complete  sets  of  slides  this  could  be  eliminated.   It  would  also 
be  possible  to  have  longer  test  runs  between  shufflings  and  to  lower  the 
total  number  of  runs  necessary. 

Tables  20  through  23  give  the  data  recorded  from  the  second  test 
for  the  spectrographic  and  zero-crossing  displays,  respectively.   Only  two 
runs  were  made  using  this  test  and  the  data  is  shown  in  two  forms:   a  de- 
tailed matrix  showing  the  results  of  the  comparisons  of  the  display 
instances  of  particular  speakers  and  a  summary  matrix  showing  the  propor- 
tion of  "same"  responses.   In  the  detailed  comparison  matrix  the  speakers 
are  listed  along  the  sides  of  the  matrix  for  each  word  in  the  test.   For 
each  instance  pair  tested  a  letter  will  appear  at  the  respective  inter- 
section in  the  matrix:   a  "d"  if  the  response  was  "different",  an  "s"  if 
the  response  was  "same".   If  no  letter  appears,  the  pair  was  not  tested 
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Table  20     Detailed  Comparison  Matrix  for  Subject  A, 
Test   2,   Spectrographs  Display 
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Table  21  Summary  Comparison  Matrix  for  Subject  A, 
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Table  22  Detailed  Comparison  Matrix  for  Subject  A, 
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and  if  more  than  one  appears  it  was  tested  more  than  once.   In  the  case 
of  pairs  of  instances  representing  the  same  word,  the  proportion  of  "same" 
responses  is  given,  since  there  were  too  many  cases  to  write  out  a  letter 
for  each  one. 

The  biggest  problem  with  this  test  was  the  length  of  time  it 
took  to  perform  it.   Subject  A  worked  a  total  time  of  about  6  hours  on  the 
test  for  the  spectrographic  display  and  was  still  able  to  only  see  approx- 
imately half  of  the  total  number  of  possible  comparisons.   A  great  deal 
of  the  time  was  taken  up  in  the  procedural  problems  mentioned  above  and  a 
double  set  of  slides  would  probably  cut  down  the  amount  of  time  needed  by 
a  significant  factor.   However,  the  fact  remains  that  the  procedure  still 
will  take  a  great  amount  of  time  because  of  the  large  number  of  instance 
pairs  which  must  be  tested.   The  results  from  the  comparison  test  show 
several  interesting  features.   In  general,  the  mistakes  appear  to  be  made 
on  the  same  word  pairs  in  both  the  zero-crossing  and  spectrographic  dis- 
play types  although  the  spectrographic  display  has  a  higher  error  rate  in 
almost  every  case.   This  would  tend  to  indicate  that  although  the  subjects 
have  trouble  on  the  same  type  of  comparisons  (at  least  as  far  as  the  words 
which  were  tested  are  concerned) ,  the  zero-crossing  display  tends  to  allow 
the  subject  to  resolve  the  differences  more  accurately.   (it  should  be 
noted  in  regard  to  the  problem  of  learning  effects  that  subject  A  performed 
the  test  for  the  zero-crossing  display  first). 

The  detailed  data  from  the  comparison  test  agrees  with  the 
results  from  the  first  test  in  certain  respects.   In  the  cases  where  the 
same  word  spoken  by  two  different  speakers  was  presented,  the  subject  tended 
to  make  errors  on  the  same  instances  as  in  the  first  test.   This  effect  is 
most  noticeable  in  the  case  of  "zed"  spoken  by  speaker  C.   When  this  instance 
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was  compared  to  either  speaker  a  or  speaker  d's  "zed"  the  subject  made  a 
high  error  rate,  hut  he  made  a  perfect  score  when  speaker  b  was  compared 
to  a  or  c. 


Chapter  8 
SUMMARY  AND  CONCLUSIONS 

The  discussion  of  the  experimental  results  can  be  "broken  down 
into  two  main  areas:   a  discussion  of  the  tests  themselves  and  a  discussion 
of  the  general  ideas  behind  the  testing. 
8.1  Comments  on  the  Tests  Which  Were  Performed 

Although  the  tests  which  were  performed  could  not  be  used  to 
establish  reliable  comparisons  between  the  various  display  types  due  to 
the  small  number  of  subjects  which  were  used,  they  did  indicate  several 
points  about  the  procedures  to  be  used. 

In  picking  out  the  words  to  be  used  in  the  test,  an  attempt  was 
made  to  select  a  variety  of  words  which  would  contain  all  of  the  common 
phonemes  (at  least  in  the  English  language).   In  order  to  minimize  the 
total  number  of  words,  however,  the  selection  was  restricted  in  such  a  way 
that  most  of  the  words  in  the  list  differed  from  one  another  in  several 
ways.   It  was  originally  felt  that  the  effects  of  single  phonemes  could  be 
determined  from  a  multi-variant  analysis  of  data  from  the  complete  set  of 
words.   Unfortunately,  the  amount  of  motivation  and  work  necessary  to 
adequately  perform  on  a  test  with  ^0  or  50  different  word  displays  to 
remember  is  much  more  than  the  average  subject  will  ever  have.   When 
smaller  subsets  of  the  word  list  are  used,  it  is  not  possible  to  control 
all  of  the  variance. 

Thus  one  basic  change  which  should  probably  be  made  in  the  word 
lists  is  to  use  nonsense  consonant-vowel-consonant  syllables  and  to  pick 
these  syllables  in  such  a  way  as  to  have  subsets  in  which  each  "word" 
differs  in  only  one  phoneme.   In  order  to  keep  the  total  number  of  words 
in  the  data  set  at  a  minimum,  it  will  be  necessary  to  restrict  the  number 
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of  vowels  used  in  these  subsets.   The  vowels  could  then  be  tested  using 
a  subset  in  which  only  they  vary. 

The  recordings  themselves  should  be  made  by  trained  speakers. 
In  the  current  tests  this  was  not  done  and  the  result  was  that  it  was 
sometimes  difficult  to  tell  whether  the  subject  confused  two  different 
examples  of  a  particular  word  because  the  display  was  ineffective  or 
because  the  audio  recordings  of  the  words  were  not  differentiable. 

In  addition  many  more  recordings  of  each  word  should  be  used. 
In  the  current  tests,  the  subjects  viewed  all  the  instances  during  each 
trial  set  and  from  the  comments  which  were  made,  it  was  apparent  that 
they  were  memorizing  specific  instances.   In  order  to  avoid  this,  it 
would  be  necessary  to  have  enough  instances  so  that  several  test  sets 
could  be  run  without  repeating  any  instances. 
8.2  Comments  on  the  General  Method 

Above  and  beyond  any  purely  technical  problems  with  the  tests 
themselves  there  are  some  more  general  problems  involving  the  whole  idea 
behind  the  tests.   As  was  mentioned  in  Section  h,   there  is  the  problem 
of  whether  or  not  the  subject  can  use  a  display  to  correct  his  pronuncia- 
tion even  if  he  is  able  to  detect  that  there  is  a  difference  between  his 
pronunciation  and  the  standard.   This  can  only  be  answered  when  a  real- 
time display  system  is  developed  and  tests  can  be  conducted  on-line  with 
the  system. 

There  are  other  problems  as  well.   For  one  thing,  the  word 
identification  type  of  testing  which  was  used  in  this  experiment  is  not 
exactly  the  same  type  of  situation  as    will  be  needed  in  the  final  use 
of  the  system.   It  might  very  well  be  that  the  speech  deformations 
encountered  in  training  the  deaf  or  teaching  the  pronunciation  of  a 
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foreign  language  are  qualitatively  different  from  the  differences  between 
the  pronunciation  of  different  words  in  a  single  language.   In  such  a  case, 
the  present  type  of  testing  may  be  inappropriate  insofar  as  determining 
the  suitability  of  the  various  display  types.   This  question  can  be  solved 
by  using  the  appropriate  types  of  recordings  and  seeing  if  the  results  of  the 
tests  change  in  any  way. 

One  other  objection  to  this  technique  is  the  difficulty  of  apply- 
ing it  to  the  specialized  displays  which  are  often  used  in  speech  correc- 
tion, such  as  pitch  indicators,  nasality  indicators,  etc.   In  principle 
these  types  of  indicators  could  probably  be  tested  using  the  present 
techniques  and  the  displays  could  probably  be  generated  quite  easily  by 
the  system.   However,  in  the  case  of  this  type  of  display,  a  much  simpler 
testing  method  could  probably  be  devised. 
8.3  Summary 

The  purpose  of  the  present  project  was  to  develop  a  computer  speech 
display  simulation  system  capable  of  generating  a  wide  variety  of  speech 
displays  from  a  recorded  speech  input.   Eventually  it  is  hoped  that  this 
will  lead  to  a  system  whereby  a  person  can  obtain  visual  feedback  as  a 
corrective  measure  for  word  pronunciation'.   The  basic  system  would  involve 
two  displays,  one  representing  the  subject's  pronunciation  of  a  particular 
word  and  the  other  representing  a  correct  pronunciation  of  the  word.  A 
computer  would  be  used  to  process  the  Incoming  speech  and  produce  a  display 
containing  features  highly  relevant  to  correct  pronunciation.   The  sub- 
ject's task  would  be  to  detect  differences  in  the  two  displays  and  to 
change  his  pronunciation  so  as  to  make  them  more  similar. 

After  conducting  an  extensive  literature  search  to  determine  the 
types  of  schemes  which  had  previously  been  used  to  display  speech  sounds, 
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a  basic  interactive  display  system  was  programmed  using  the  CSL's  CDC 
160^  computer-graphics  facility.   The  system  has  been  designed  to  be  open- 
ended  and  currently  can  produce  photographs  of  a  variety  of  display  types. 
Unfortunately,  the  system  as  it  stands  now  cannot  operate  in  real  time  due 
to  the  slowness  of  the  CDC  160U. 

The  simulation  system  was  used  to  produce  examples  of  several 
different  types  of  displays.   These  displays  were  used  in  a  series  of 
preliminary  tests  designed  to  develop  techniques  for  comparing  the  effec- 
tiveness of  various  types  of  displays.   Several  corrections  and  refinements 
to  the  testing  methods  are  discussed. 
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